Bridging Vision and Language: A Deep Dive into CLIP, BLIP, and OWL-ViT
Written on
Exploring three groundbreaking models—CLIP, BLIP, and OWL-ViT—this article delves into their use of contrastive learning to correlate images and text, achieving remarkable outcomes across various tasks. We will examine how CLIP's strategy of aligning image and text embeddings has influenced BLIP and OWL-ViT, which tackle new challenges in visual comprehension and object detection. Let’s explore the transformative impact of these models on the field of artificial intelligence.
CLIP (Contrastive Language-Image Pre-training)
OpenAI's CLIP revolutionized the field by emphasizing natural language as a significant supervisory element for learning visual concepts. It posits that utilizing the extensive text available online can significantly enhance the learning capabilities of visual models, transcending the limitations of conventional image datasets.
1. Model Architecture and Pre-training
The CLIP framework consists of both a text encoder and an image encoder, which process text and images, respectively.
Image Encoder
- The study evaluates two main architecture types for the image encoder: a ResNet-50 base model and various Vision Transformers (ViT). Training involved five ResNet models and three Vision Transformer models, including ResNet-50, ResNet-101, and three EfficientNet-styled models (RN50x4, RN50x16, RN50x64) that utilize 4x, 16x, and 64x the computational resources of ResNet-50. For Vision Transformers, the models used are ViT-B/32, ViT-B/16, and ViT-L/14.
- Among these, the ViT-L/14@336px model, which was trained at a higher resolution, emerged as the top-performing model and was referenced in the study as “CLIP.”
Text Encoder
- Built on a Transformer architecture and incorporating enhancements from Radford et al. (2019), it features a 12-layer design with a width of 512 units and 8 attention heads, totaling around 63 million parameters.
- Text Embedding: It employs byte pair encoding (BPE) with a vocabulary size of 49,152 to efficiently manage various text inputs.
- Sequence Handling: The maximum sequence length is limited to 76 tokens to optimize computational efficiency, with special tokens denoting the start ([SOS]) and end ([EOS]) of sequences.
Training
- The training objective for CLIP is contrastive learning, aiming to enhance the similarity between embeddings of correct image-text pairs while reducing it for incorrect pairs. This is achieved through a symmetric cross-entropy loss based on these similarities.
- Image-Text Pairs: The dataset consists of pairs linking each image with a corresponding descriptive text, such as captions or titles that accurately depict the visual content.
- Source of Data: These pairs are generally obtained from the internet, databases, or datasets tailored for multimodal learning tasks. This includes datasets sourced from websites where images come with captions or curated datasets with human-generated annotations.
- Projection Method: A linear projection is utilized to map each encoder's representations into a shared multi-modal embedding space, steering clear of more complex non-linear projections that did not yield significant training efficiency benefits.
BLIP (Bootstrapping Language-Image Pre-training)
Due to the high costs associated with acquiring quality human-annotated image-text pairs (e.g., the COCO dataset), recent strategies have turned to larger collections of image and alt-text pairs sourced from the web. However, these alt-texts often fail to accurately represent the images, resulting in noisy signals that hinder effective vision-language alignment. To counteract this, BLIP introduces CapFilt, a method designed to filter out noisy pairs and produce new captions for images. The following sections will discuss the model architecture, along with the pre-training and fine-tuning processes.
1. Model Architecture and Pretraining in BLIP
Multimodal Mixture of Encoder-Decoder (MED) is a model that combines both comprehension and generation capabilities, operating in three distinct modes.
Unimodal Encoder:
- This component encodes images and text separately. The text encoder functions similarly to BERT, incorporating a [CLS] token to summarize the sentence. A visual transformer (ViT) processes the image by segmenting it into patches, encoding them into a sequence of embeddings, including an additional [CLS] token for the global image feature.
- It is trained using an image-text contrastive (ITC) loss to align the feature spaces of both encoders, promoting similarity for positive image-text pairs and differentiating them from negative pairs.
Image-grounded Text Encoder:
- This section integrates a cross-attention layer into the text encoder, enhancing it with visual information. A task-specific [Encode] token is added to the text, and the output embedding of this token serves as the multimodal representation of the image-text pair.
- It is trained using an image-text matching (ITM) loss, a binary classification task where the model predicts if an image-text pair is a match or not based on their multimodal features.
Image-grounded Text Decoder:
- This component substitutes the bidirectional self-attention layers with causal self-attention layers, signaling the start and end of sequences using a [Decode] token and an end-of-sequence token, respectively.
- The decoder is trained using a language modeling (LM) loss to generate captions from images, optimizing a cross-entropy loss to maximize the likelihood of text in an autoregressive manner.
Parameter Sharing: All parameters, except for self-attention layers, are shared between the text encoder and decoder to enhance training efficiency. The key difference lies in how the tasks are handled: the encoder uses bi-directional self-attention for current input tokens, while the decoder employs causal self-attention for predicting subsequent tokens.
2. CapFilt
CapFilt comprises two main modules: a captioner and a filter, as illustrated in the following figure. Both modules are initialized from the same pre-trained MED model and fine-tuned individually on the COCO dataset.
- Captioner: This image-grounded text decoder is fine-tuned using the Language Modeling (LM) objective to create synthetic captions for web images, generating one synthetic caption for each web image.
- Filter: This image-grounded text encoder is fine-tuned with ITC and ITM objectives to assess if a text matches an image, eliminating noisy texts from both original web sources and synthetic captions by classifying them as noisy if deemed unmatched by the ITM head.
The filtered image-text pairs produced by CapFilt are combined with high-quality human-annotated pairs to form a new dataset. This dataset is utilized to pre-train a new model, leveraging both synthetic and filtered data to enhance vision-language learning.
OWL-ViT (Open-Vocabulary Object Detection with Vision Transformers)
Conventional object detection models, like YOLO (You Only Look Once), are restricted to a fixed set of categories, limiting their detection capabilities to predefined objects. Conversely, open-vocabulary detection models can generalize to new, previously unseen categories by utilizing extra information, such as natural language descriptions or image-text pairs, for recognizing and localizing objects beyond their training scope. This section will explore Vision Transformer for Open-World Localization (OWL-ViT), which establishes a flexible and scalable foundation for future research in open-vocabulary localization tasks.
1. Model Architecture and Pretraining in OWL-ViT
The objective of the paper is to create a straightforward and scalable open-vocabulary object detector using standard Transformer models, capitalizing on their scalability and performance in closed-vocabulary detection. The approach consists of two stages:
- Contrastively pre-training image and text encoders on large-scale image-text data.
- Incorporating detection heads and fine-tuning on medium-sized detection datasets.
Architecture:
- The model employs a Vision Transformer (ViT) for the image encoder and a similar Transformer for the text encoder (as depicted in the left figure). This forms the pretraining stage.
- At this stage, both encoders are pre-trained from scratch to develop shared representations. Image representations are aggregated via multihead attention pooling (MAP), while text representations are extracted from the final token of the text encoder.
- For the detection stage, the image encoder is adapted by removing token pooling and the final projection layer; each output token is then linearly projected to generate pre-object image embeddings for classification.
- The number of predicted objects corresponds to the sequence length of the image encoder (at least 576 tokens for ViT-B/32 with an input size of 768 × 768), which is sufficient for current datasets (e.g., LVIS with 294 instances).
- Bounding Box Head: The predicted bounding box coordinates are derived by processing image token representations through a small MLP.
- Classification Head: The model features a lightweight classification head attached to each token output, responsible for identifying the class of the object represented by that token.
- Text Embeddings for Open-Vocabulary Classification: Instead of fixed class embeddings, the model employs text embeddings sourced from the text encoder, created by passing category names or textual descriptions through the text encoder.
- Query Matching: During inference, the model aligns each predicted box coordinate with text-derived query embeddings (which represent possible object classes). It calculates the similarity between image and text embeddings to predict the likelihood that each class (query) applies to the object within the predicted box.
- OWL-ViT refrains from fusion of image and text embeddings. Instead, text and image embeddings are matched at the final stage to ascertain which text queries (object descriptions) correspond with which image regions (bounding boxes). This methodology allows for precomputing text embeddings independently from the image, requiring only one forward pass through the image encoder, regardless of the number of text queries, thereby significantly enhancing inference efficiency.
- One- or Few-Shot Transfer: Given the absence of fusion between image and text encoders, the model accommodates various types of embeddings, such as image-derived embeddings. This permits the detection of objects that are challenging to describe textually, utilizing representative images instead.
In conclusion, CLIP, BLIP, and OWL-ViT signify substantial progress in merging visual and textual data through contrastive learning. CLIP's innovative approach has paved the way for subsequent models like BLIP and OWL-ViT, each addressing distinct challenges in visual comprehension and object detection. These models not only demonstrate the potential of integrating images and text but also lay the groundwork for future advancements in artificial intelligence.
Thank you for taking the time to read this article. I hope you found it informative!
References
- CLIP paper: https://arxiv.org/abs/2103.00020
- BLIP paper: https://arxiv.org/abs/2201.12086
- OWL-ViT: https://arxiv.org/abs/2205.06230