ML + Vision Top-6 Agent Survey - ICCV 2023¶

Venue: IEEE International Conference on Computer Vision
Year: 2023
Page: 1 / 1
Papers: 19

Papers

What does CLIP know about a red circle? Visual prompt engineering for VLMs Paper

Authors: Aleksandar Shtedritski, C. Rupprecht, A. Vedaldi
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01101
Citations: 268
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model’s attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention to some potential ethical concerns of large language-vision models.

Claim

Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation.

Teaching CLIP to Count to Ten Paper

Authors: Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, M. Irani, Tali Dekel
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.00294
Citations: 206
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation – they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard" as a negative example. Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP’s capabilities to object counting. Furthermore, we introduce "CountBench" – a new image-text counting benchmark for evaluating object counting capabilities. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our counting-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

Claim

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation.

Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models Paper

Authors: Zangwei Zheng, Mingyu Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, Yang You
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01752
Citations: 145
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model’s zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https: //github.com/Thunderbeee/ZSCL.

Claim

Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training.

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance Paper

Authors: Zeyi Huang, Andy Zhou, Zijia Lin, Mu Cai, Haohan Wang, Yong Jae Lee
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01073
Citations: 59
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language model, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student’s learned image representations to be close to the teacher’s learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets, and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.

Claim

Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain.

Distribution-Aware Prompt Tuning for Vision-Language Models Paper

Authors: Eulrang Cho, Jooyeon Kim, Hyunwoo J. Kim
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.02011
Citations: 58
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained vision-language models (VLMs) have shown impressive performance on various downstream tasks by utilizing knowledge learned from large data. In general, the performance of VLMs on target tasks can be further improved by prompt tuning, which adds context to the input image or text. By leveraging data from target tasks, various prompt-tuning methods have been studied in the literature. A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed. We observed that the alignment becomes more effective when embeddings of each modality are ‘well-arranged’ in the latent space. Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models, which is simple yet effective. Specifically, the prompts are learned by maximizing inter-dispersion, the distance between classes, as well as minimizing the intra-dispersion measured by the distance between embeddings from the same class. Our extensive experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability. The code is available at https://github.com/mlvlab/DAPT.

Claim

Pre-trained vision-language models (VLMs) have shown impressive performance on various downstream tasks by utilizing knowledge learned from large data.

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory Paper

Authors: Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, Yang Liu
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.00596
Citations: 58
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM.

Claim

Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object.

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models Paper

Authors: Matthew Trager, Pramuditha Perera, L. Zancato, A. Achille, Parminder Bhatia, S. Soatto
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01412
Citations: 55
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a preexisting vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP’s embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.

Claim

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs).

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models Paper

Authors: Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng Zheng
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01436
Citations: 55
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes, failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes the inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further design an adaptation head for the visual encoder to aggregate salient attentive visual cues, which establishes discriminative and task-aware visual representations. We conduct extensive experiments on 11 widely-used benchmark datasets and the results verify the effectiveness in few-shot image classification, especially in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp method, KAPT exhibits favorable performance and achieves an absolute gain of 3.22% on new classes and 2.57% in terms of harmonic mean.

Claim

Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning.

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents Paper

Authors: Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheol-Hui Min, Jonghyun Choi
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01004
Citations: 52
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
Code: Not found.
Extraction: method/data pending

Abstract

Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).

Claim

Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions.

Going Beyond Nouns With Vision & Language Models Using Synthetic Data Paper

Authors: Paola Cascante-Bonilla, Khaled Shehada, James Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, A. Oliva, Vicente Ordonez, R. Feris, Leonid Karlinsky
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01844
Citations: 49
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go ‘beyond nouns’ such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.

Claim

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts.

Black Box Few-Shot Adaptation for Vision-Language models Paper

Authors: Yassine Ouali, Adrian Bulat, Brais Martínez, Georgios Tzimiropoulos
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01424
Citations: 49
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model’s weights, (b) it is orders of magnitude faster at training time, © it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.Code available at: https://github.com/saic-fi/LFA

Claim

Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability Paper

Authors: Xuanlin Li, Yunhao Fang, Minghua Liu, Z. Ling, Z. Tu, Haoran Su
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.00236
Citations: 48
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language model, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student’s OOD generalization: (1) by better imitating teacher’s visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher’s language representations with informative and fine-grained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Code released at this link.

Claim

Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical.

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models Paper

Authors: Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Meng Li, Wei Ji, Qi Tian, Tat-seng Chua, Yueting Zhuang
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.00241
Citations: 35
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter- and data- efficient way, by learning the "soft prompts" to condition frozen pretraining models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods.

Claim

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter- and data- efficient way, by learning the "soft prompts" to condition frozen pretraining models.

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models Paper

Authors: Cheng Shi, Sibei Yang
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.00274
Citations: 35
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt "Let’s think step by step" improved GPT-3’s reasoning accuracy to 63% on MutiArith while prompting "a photo of" filled with a class name enables CLIP to achieve 80% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods’ generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.

Claim

Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks.

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? Paper

Authors: Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01420
Citations: 30
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-language models such as CLIP [27] learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.

Claim

Vision-language models such as CLIP [27] learn a generic text-image embedding from large-scale training data.

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data Paper

Authors: M. Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, C. Langlotz
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.02031
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points1

Claim

Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web.

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models Paper

Authors: Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zhengjun Zha, Wei Chen, Yujun Shen
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.01071
Citations: 13
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here.

Claim

Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks.

Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation Paper

Authors: Eric Ming Chen, Sidhanth Holalkere, Ruyu Yan, Kai Zhang, A. Davis
Year: 2023
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51070.2023.02124
Citations: 11
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: alpha factor search (matched: trading).
Code: Not found.
Extraction: method/data pending

Abstract

Multi-view image generation attracts particular attention these days due to its promising 3D-related applications, e.g., image viewpoint editing. Most existing methods follow a paradigm where a 3D representation is first synthesized, and then rendered into 2D images to ensure photo-consistency across viewpoints. However, such explicit bias for photo-consistency sacrifices photo-realism, causing geometry artifacts and loss of fine-scale details when these methods are applied to edit real images. To address this issue, we propose ray conditioning, a geometry-free alternative that relaxes the photo-consistency constraint. Our method generates multi-view images by conditioning a 2D GAN on a light field prior. With explicit viewpoint control, state-of-the-art photo-realism and identity consistency, our method is particularly suited for the viewpoint editing task.

Claim

Multi-view image generation attracts particular attention these days due to its promising 3D-related applications, e.g., image viewpoint editing.