Image Generation
GLIDE
GLIDE
Recently, numerous AGI applications catch the eyes of almost all the people on the internet. Here lists some advanced papers elucidate their key principles and technologies. DiT The authors explore a new class of diffusion models based on the transformer architecture, Diffusion Transformers (DITs)1. Before their work, using a U-Net backbone to generate the target image is prevalent instead of using a transformer architecture. The authors make some experiments with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens....
With the swift development of deep neural networks, a multitude of models handling diverse information modalities like text, speech, images, and videos have proliferated. Among AI researchers, it’s widely acknowledged that multimodality is the future of AI. Let’s explore the advancements in multimodality in recent years. Texts & Images CLIP CLIP (radford et al., 20211) thinks learning directly from raw text about images is promising alternative which leverage much a boarder source of supervision....
techniques have improved on not only text data but also computer vision recently. Here we focus on Visual Language Model (VLM) based on transformers. In the begining, some researchers try to extend BERT to process visual data and make a success. For example, visual-BERT and ViL-BERT achive strong performances on many visual tasks by training on two different objectives: 1) masked modeling task that aims to predict the missing part of a given input; and 2) a match task that aims to predict if the text and the image content are matched....