With the swift development of deep neural networks, a multitude of models handling diverse information modalities like text, speech, images, and videos have proliferated. Among AI researchers, it’s widely acknowledged that multimodality is the future of AI. Let’s explore the advancements in multimodality in recent years.

Texts & Images

CLIP

CLIP (radford et al., 20211) thinks learning directly from raw text about images is promising alternative which leverage much a boarder source of supervision. Based on the consideration of computation budget and performance, the authors choose contrastive representation learning over directly predicting objectives. To train this model, we need a text encoder and an image encoder to get text and image representations, and then maximize the cosine similarity of them.

Figure 1. Illustrate how to train and inference.

Figure 1. Illustrate how to train and inference.(Image source: Learning Transferable Visual Models From Natural Language Supervision)

Pseudocode for the core of an implementation of CLIP (radford et al., 20211) is here:

import numpy as np
# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

CLIP (radford et al., 20211) offers significant benefits for that task has relative little data given its zero-shot capability. The study underscores the substantial potential of pre-training techniques for multimodal applications. Therefore, shortly thereafter, numerous applications based on CLIP emerged.

ALIGN

ALIGN (jia et al., 20212) leverages a noisy data of over one bilion image alt-text pairs to train a model, which only has a simple dual-encoder architecture, to align visual and language representation of image and text pairs use a contrastive loss.

Figure 2. A summary of ALIGN methods, which doesn’t need curated data. (Images source: Learning Transferable Visual Models From Natural Language Supervision)

Figure 2. A summary of ALIGN methods, which doesn’t need curated data. (Images source: Learning Transferable Visual Models From Natural Language Supervision)

The authors only apply minimal frequency-based filtering on image and text, such as filtering alt-texts that are shared by more than 10 images, and filtering irrelevant content (e.g., “1980x1080”, “alt_img”), etc. While pre-training, the authors construct two losses: one for image-to-text classification, other one for text-to-image classification: $$L_{i2t} = -\frac{1}{N} \sum^N_i \log \frac{\exp(x_i^T y_i/\sigma)}{\sum_{j=1}^N \exp(x_i^T y_j/\sigma)}$$ $$L_{t2i} = -\frac{1}{N} \sum^N_i \log \frac{\exp(y_i^T x_i/\sigma)}{\sum_{j=1}^N \exp(y_i^T x_j/\sigma)}$$ where $x_i$ and $y_j$ are the normalized of image in the $i$-th pair and that of text in the $j$-th pair, respectively. $N$ is the batch size, and $\sigma$ is the temperature to scale the logits. It is worth noting that the temperature parameter $\sigma$ is not set manually but learned jointly together with other parameters.

ViLT

ViLT (kim et al., 20213) is a simple architecture for a vision-and-language model as it commissions the transformer module to handle vision features in place of a seperate deep vision embedder. This architecture concentrates most of the computation on modality interaction other than feature extraction, thereby achieves competent performance on vision-and-language tasks without using region features or deep convolutional visual encoder.

Figure 3. Four cagegories of vision-and-language model. CLIP belongs to (b), ViLT belongs to (d). (Image source: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision)

Figure 3. Four cagegories of vision-and-language model. CLIP belongs to (b), ViLT belongs to (d). (Image source: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision)

Building on pre-training objectives, the authors use image text matching (ITM) and masked language modeling (MLM). ViLT is competent to competitors which are heavily equipped with convolutional visual embedding networks (e.g., Faster R-CNN and ResNets). Hence, the authors conclude that future work on VLP should focus more on the modality interactions inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders.

Figure 4. ViLT model architecture. (Image source: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision)

Figure 4. ViLT model architecture. (Image source: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision)

ALBEF

Most of previous methods choose to employ a transformer-based multimodal encoder to jointly model visual and language tokens. because the visual tokens and text tokens are unaligned, it is challenging for multimodal encoder to learn a optimal interaction. ALBEF (li et al., 20214) introduce a contrastive loss to align the text and image representation before modality fussion. To improve data efficiency from raw noisy data, the study proposes momentum distillation, a self-training method which learns the pseudo targets produced by the momentum model.

Figure 5. Illustration of ALBEF. It consists of three encoders: image encoder, text encoder, and a multimodal encoder. (Image source: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation)

Figure 5. Illustration of ALBEF. It consists of three encoders: image encoder, text encoder, and a multimodal encoder. (Image source: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation)

The pretrain process is relative complex, which includes three objectives: image-text contrastive learning (ITC) on the unimodal encoders, masked language modeling (MLM) learning and image-text matching (ITM) on the multimodal encoder. specifically, the authors improve ITM with online contrastive hard negative mining.

Image-Text Contrastive learning (ITC): $$p_{m}^{i2t}(I) = \frac{\exp(s(I; T_{m})/\tau)}{\sum_{m=1}^{M} \exp(s(I; T_{m})/\tau)},p_{m}^{t2i}(I) = \frac{\exp(s(T; I_{m})/\tau)}{\sum_{m=1}^{M} \exp(s(T; I_{m})/\tau)}$$

$$\mathcal{L}_{\text{itc}} = \frac{1}{2} \mathbb{E}_{(I; T) \sim{D} } [ \text{H}(y^{i2t} (I), p^{i2t} (I)) + \text{H} (y^{t2i}(T), p^{t2i} (T) )]$$

Here, $s(I;T)$ denotes the similarity of image $I$ and text $T$, $\tau$ is a learnable temperature parameter, $y^{i2t}$ and $y^{t2i}$ denote the one-hot ground truth similarity, and $\text{H}$ is the cross ent ropy of two objects. The image-text contrastive loss is $\mathcal{L}_{\text{itc}}$.

Masked Language modeling (MLM):

MLM task utilizes both the image and the text to predict the masked words. The masking strategy is the same as BERT (devlin et al., 20185). MLM minimazes a cross-entropy loss:

$$\mathcal{L}_{mlm} = \mathbb{E}_{(I, \widehat{T}) \sim D} H( y^{msk}, p^{msk} (I, \widehat{T}))$$

where $\widehat{T}$ denotes a masked text, $p^{msk}(I, \widehat{T})$ denotes the model’s probability for a masked token, $y^{msk}$ is a one-hot vocabulary distribution in which the group truth token’s value is 1.

Image-Text Matching (ITM):

ITM predicts if a pair of text and image is matched or not matched. The CLS token of multimodal encoder is used the joint representation of the image-text pair. The ITM loss is:

$$\mathcal{L}_{itm} = \mathbb{E}_{(I, T) \sim D} H( y^{itm}, p^{itm} (I, T))$$

where $y^{itm}$ is a 2-dimensional one-hot vector representing the ground truth label.

Finally, the full objective loss is:

$$\mathcal{L} = \mathcal{L}_{itc} + \mathcal{L}_{mlm} + \mathcal{L}_{itm}$$

Momentum Distillation:

As one-hot labels for ITC and MLM penalize all negative predictions regardless of their correctness, to address this, the authors propose to learn from pseudo-targets generated by the momentum model. The momentum model is a continuously evolving teacher which consists of exponential-moving-average (EMA) of the unimodal and multimodal encoders. For ITC task, the momentum loss is:

$$\mathcal{L}_{itc}^{mod} = (1 - \alpha) \mathcal{L}_{itc} + \frac{\alpha}{2} \mathbb{E}_{(I, T) \sim D}[KL(q^{i2t}(I)||p^{i2t}(I)) + KL(q^{t2i}(T) || p^{t2i}(T))]$$

where $q^{i2t}$ and $q^{t2i}$ are the pseudo targets generated by the momentum model.

Similarly, the MLM momentum loss is:

$$\mathcal{L}_{mlm}^{mod} = (1 - \alpha) \mathcal{L}_{mlm} + \alpha\mathbb{E}_{(I, \widehat{T}) \sim D}KL(q^{msk}(I, \widehat{T})||p^{msk}(I, \widehat{T}))$$ ALBEF propels the multimodality model to a new height, leading to the development of numerous related studies.

VLMO

VLMO(Bao et al., 20226) presents a unified Vision-Language Pretrained Model (VLMO) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. For the sake of encoding various modalities (images, text, and image-text pairs) within single Transformer block, VLMO introduces Mixture-of-Modality-Experts (MoME). V-FFN is designed for image-only data, and L-FFN is for text-only data. However, during training with iamge-text pair data, all modules (V-FFN, L-FFN and VL-FFN) are utilized.

Figure 6. Overview of VLMO pre-training. (Image source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts)

Figure 6. Overview of VLMO pre-training. (Image source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts)

Another noteworthy case is that self-attention module is frozen during training with only text data.

Figure 7. Mixture of multi-experts. (Image source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts)

Figure 7. Mixture of multi-experts. (Image source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts)

BLIP

Most existing frameworks of vision language model (VLP) only excel in either understanding-based tasks or generation-based tasks. BLIP (li et al., 20227) presents a new framework to transfer flexible to both vision-language understanding and generation tasks. Unlike previous works, BLIP argues the noisy web texts are suboptimal for vision-language learning and address this problem by proposing a novel method.

BLIP introduces two important contributions: a new model architecture named multimodal mixture of encoder-decoder (MED), and a new data bootstrapping method captioning and filtering (CapFilt) which aims to learn from noisy image-text pairs.

Figure 8. Pre-training model architecture and objectives of BLIP (same parameters have the same color. ITC task is trained without cross-attention. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

Figure 8. Pre-training model architecture and objectives of BLIP (same parameters have the same color. ITC task is trained without cross-attention. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

Figure 9. BLIP learning framworkd, include model workflow and data workflow. The bootstrapped data will be used to pre-train the model. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

Figure 9. BLIP learning framworkd, include model workflow and data workflow. The bootstrapped data will be used to pre-train the model. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

Figure 10. Examples of captions generated by BLIP, green texts is accepted by filter, and red texts is rejected by filter. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

Figure 10. Examples of captions generated by BLIP, green texts is accepted by filter, and red texts is rejected by filter. (Image source: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

MAE

Vary from past approaches, MAE(he et al., 20228) proposes an architecture that involves masking random patches of an input image and reconstructing the missing pixels. Masking a high proportion of input images, such as 75%, yields a nontrivial and meaningful self-supervised task. In MAE, encoder and decoder are not in symmetry. The decoder is designed to predict the pixel value of the masked patch. The authors also study a variant whose reconstruction target is the normalized pixel values of each masked patch, and find that improves the representation quality.

Figure 11. The architecture consist of an encoder and decoder. (Image source: Masked Autoencoders Are Scalable Vision Learners)

Figure 11. The architecture consist of an encoder and decoder. (Image source: Masked Autoencoders Are Scalable Vision Learners)

Tokens vs Pixels: the studiers compare tokens and pixels as the reconstruction terms of decoder, and the experiments show that tokenization is not necessary for MAE.

Figure 12. Tokens vs Pixels. Tokens don’t bring any benifits. (Image source: Masked Autoencoders Are Scalable Vision Learners)

Figure 12. Tokens vs Pixels. Tokens don’t bring any benifits. (Image source: Masked Autoencoders Are Scalable Vision Learners)

Monkey

Monkey(li et al., 20239) focuses on enhancing large multimodal models (LMMs) for high-resolution input and detailed scene understanding. Compared to the approach of directly interpolating the ViT to increase input resolution, Monkey utilizes a novel module that divides high-resolution images into smaller patches by a sliding window method. Each patch is processed independently by a static visual encoder, enhanced with LoRA adjustments and a trainable visual resampler.

Figure 13. Each patch is processed by independently by a static visual encoder, enhanced with LoRA adjustments and a trainable visual resampler。 All patches are processed through the shared static Vit encoder, such as Vit-BigG with 2b parameters. (Image source: Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models)

Figure 13. Each patch is processed by independently by a static visual encoder, enhanced with LoRA adjustments and a trainable visual resampler。 All patches are processed through the shared static Vit encoder, such as Vit-BigG with 2b parameters. (Image source: Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models)

Mini-Gemini

Cambrian-1

Cambrian-110 is a herd of multimodal LLMs (MLLMs) designed with a vision-contric approach. Previous multimodal models have primarily focused on language understanding, often neglecting the learning of visual representations. In this study, the authors using LLM and visual instruction tuning as an interface to evaluate various visual representations, offer a new insight into different methods (self-supervised, strongly supervised, or combination thereof). Cambrian thinks multimodal connector has three main designs: Resamplers, Q-Formers, and MLP Projectors, nevertheness all of them, more or lesss, have a few issues. For instance, MLP Porjector leads to a large growth of visual tokens with image resolution. Some significant findings are listed here:

  1. Unfreezing the vision encocder is widely beneficial, especially for Language-supervised models.
  2. High-res vision encoders greatly enhance performance on chart & vision centrick tasks, and ConvNet-based architectures inherely perform well on such tasks.
  3. Two-stage training is benificial; more adapter data further boosts results.
  4. Language supervision offers strong advantages, but the performance gap can be narrowed with SSL methods given enough data and proper tuning.

LLaVA-OneVision

LLaVA-OneVision11 is the first single model that can simultaneously push the performance boundaries of open LMMs in three import vison scenarios: single-image, multi-image, and video senarios. LLava-OneVision is constituted by three components:

  1. LLM: the authors select Qwen-2 as LLM
  2. Vision Encoder: the authors consider the SigLIP as the vision encoder because of its performance.
  3. Projector: the researchers choose a 2-layer MLP to project image features into the word embedding space.

Another aspect noteworthy is the contribution about high quality data collection and curation. In total, LLaVA-OneVision release three datasets, including Single-Image 3.2M, OneVision1.6M.

Figure 15. Training workflow of LLaMA-OneVision. (Image source: LLaVA-OneVision: Easy Visual Task Transfer)

Figure 15. Training workflow of LLaMA-OneVision. (Image source: LLaVA-OneVision: Easy Visual Task Transfer)

The training strategy includes these points: Stage-1 updates only projector, while the subsequent stages update the full model; the learning rate for the vision encoder is 5 times smaller than that for the LLM.

Vision representation is the key to the success of the visual encoding. Generally, the goal is to capture as much raw pixel information as possible while keeping the token count limited. The workers observe that the scalling of resolution is more effective than that of token numbers, and recommend an AnyRes strategy with pooling.

Figure 16. Visual encoder setting. (Image source: LLaVA-OneVision: Easy Visual Task Transfer)

Figure 16. Visual encoder setting. (Image source: LLaVA-OneVision: Easy Visual Task Transfer)

The AnyRes strategy can serve as a flexible visual representation framework. For single-image, multi-image and video, it can represent them with tokens of nearly the same length, more details can reference to their paper.

dataset

docmatix, open to public by huggingface, contains 1.27 million image-text pairs. Singel-Image 3.2M, A High-Quality Single-Image dataset collection released by LLaVA-OneVision. OneVision 1.6M, A High-Quality single-image, multi-image and video dataset collection.

References