ML + Vision Top-6 Agent Survey - CVPR 2024 - Page 5 of 5¶

Venue: Computer Vision and Pattern Recognition
Year: 2024
Page: 5 / 5
Papers: 121-143 / 143

Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning Paper

Authors: Seung Hyun Lee, Junjie Ke, Yinxiao Li, Junfeng He, Steven Hickson, Katie Datsenko, Sangpil Kim, Ming-Hsuan Yang, Irfan Essa, Feng Yang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.02793
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

The goal of image cropping is to identify visually appealing crops in an image. Conventional methods are trained on specific datasets and fail to adapt to new requirements. Recent breakthroughs in large vision-language models (VLMs) enable visual in-context learning without explicit training. However, downstream tasks with VLMs remain under explored. In this paper, we propose an effective approach to leverage VLMs for image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, we refer to as Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.

Claim

The goal of image cropping is to identify visually appealing crops in an image.

ReWind: Understanding Long Videos with Instructed Learnable Memory Paper

Authors: Anxhelo Diko, Ting Wang, Wassim Swaileh, Shiyan Sun, Ioannis Patras
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01282
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel read-perceive-write cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind’s superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.

Claim

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Paper

Authors: Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.48550/arXiv.2411.18000
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario- matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi- image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI’s significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.

Claim

Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.

JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models Paper

Authors: Yuncheng Guo, Xiaodong Gu
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.02711
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper

Authors: Yahan Tu, Rui Hu, Jitao Sang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01847
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an openset, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs. Our code are available at https://github.com/Iridescent-y/ODE.

Claim

Hallucination poses a persistent challenge for multimodal large language models (MLLMs).

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis Paper

Authors: Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, Yao Zhu
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.02195
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.

Claim

Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts.

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models Paper

Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01859
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Claim

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning.

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Paper

Authors: Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01357
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown to be capable of interpreting images when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin.

Claim

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs).

Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-Grained Knowledge Alignment Paper

Authors: A. Ishmam, C. Thomas
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.02344
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time.

Claim

In recent years there has been enormous interest in vision-language models trained using self-supervised objectives.

Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation Paper

Authors: Gianni Franchi, D. Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.00755
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models. The code is available at https://github.com/ENSTA-U2IS-AI/Uncertainty_diffusion

Claim

Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability.

Evaluating Vision-Language Models as Evaluators in Path Planning Paper

Authors: Mohamed Aghzal, Xiangyu Yue, E. Plaku, Ziyu Yao
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.00646
Citations: 7
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators. 12

Claim

Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning.

Sonic VisionLM: Playing Sound with Vision Language Models Paper

Authors: Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.02537
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

There has been a growing interest in the task of generating sound for silent videos, primarily because of its prac-ticality in streamlining video post-production. However, existing methods for video-sound generation attempt to di-rectly create sound from visual representations, which can be challenging due to the difficulty of aligning visual rep-resentations with audio representations. In this paper, we present Sonic VisionLM, a novel framework aimed at gen-erating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio di-rectly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first iden-tifies events within the video using a VLM to suggest pos-sible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommen-dations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and de-veloped a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/

Claim

There has been a growing interest in the task of generating sound for silent videos, primarily because of its prac-ticality in streamlining video post-production.

Non-autoregressive Sequence-to-Sequence Vision-Language Models Paper

Authors: Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, S. Soatto
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.01291
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

Claim

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions.

Towards Better Vision-Inspired Vision-Language Models Paper

Authors: Yunhao Cao, Kaixiang Ji, Ziyuan Huang, Chuanyang Zheng, Jiajia Liu, Jian Wang, Jingdong Chen, Ming Yang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.01285
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Calico: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models Paper

Authors: Kiet A. Nguyen, A. Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.00429
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects and their constituent common and unique parts across images. To address this task, we present Calico, the first LVLM designed for multi-image part-level reasoning segmentation. Calico features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing ∼2.4M samples across ∼44K images spanning diverse object and part categories. Experimental results demonstrate that Calico, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.

Claim

Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning.

Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation Paper

Authors: Guangyang Wu, Xiaohong Liu, Jun Jia, Xuehao Cui, Guangtao Zhai
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.00808
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
Code: Not found.
Extraction: method/data pending

Abstract

In the digital era, QR codes serve as a linchpin connecting virtual and physical realms. Their pervasive integration across various applications highlights the demand for aesthetically pleasing codes without compromised scannability. However, prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. Notably, stable-diffusion models have ushered in an epoch of high-quality, customizable content generation. This paper introduces Text2QR, a pioneering approach leveraging these advancements to address a fundamental challenge: concurrently achieving user-defined aesthetics and scanning robustness. To ensure stable generation of aesthetic QR codes, we introduce the QR Aesthetic Blueprint (QAB) module, generating a blueprint image exerting control over the entire generation process. Subsequently, the Scannability Enhancing Latent Refinement (SELR) process refines the output iteratively in the latent space, enhancing scanning robustness. This approach harnesses the potent generation capabilities of stable-diffusion models, navigating the trade-off between image aesthetics and QR code scannability. Our experiments demonstrate the seamless fusion of visual appeal with the practical utility of aesthetic QR codes, markedly outperforming prior methods. Codes are available at https://github.com/mulns/Text2QR

Claim

In the digital era, QR codes serve as a linchpin connecting virtual and physical realms.

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs Paper

Authors: Zhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, Ping Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, et al.
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01341
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall—as well as user studies—demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51× higher recall scores on open-vocabulary object detection tasks compared to leading methods.

Claim

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing.

F3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics Paper

Authors: Pramit Saha, Felix Wagner, Divyanshu Mishra, Can Peng, A. Thakur, David Clifton, K. Kamnitsas, J. Noble
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01863
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Effective training of large Vision-Language Models (VLMs) on resource-constrained client devices in Federated Learning (FL) requires the usage of parameter-efficient finetuning (PEFT) strategies. To this end, we demonstrate the impact of two factors viz., client-specific layer importance score that selects the most important VLM layers for finetuning and inter-client layer diversity score that encourages diverse layer selection across clients for optimal VLM layer selection. We first theoretically motivate and leverage the principal eigenvalue magnitude of layerwise Neural Tangent Kernels and show its effectiveness as client-specific layer importance score. Next, we propose a novel layer updating strategy dubbed F3OCUS that jointly optimizes the layer importance and diversity factors by employing a data-free, multi-objective, meta-heuristic optimization on the server. We explore 5 different meta-heuristic algorithms and compare their effectiveness for selecting model layers and adapter layers towards PEFT-FL. Furthermore, we release a new MedVQA-FL dataset involving overall 707,962 VQA triplets and 9 modality-specific clients and utilize it to train and evaluate our method. Overall, we conduct more than 10,000 client-level experiments on 6 Vision-Language FL task settings involving 58 medical image datasets and 4 different VLM architectures of varying sizes to demonstrate the effectiveness of the proposed method. Project Page: https://pramitsaha.github.io/FOCUS/

Claim

Effective training of large Vision-Language Models (VLMs) on resource-constrained client devices in Federated Learning (FL) requires the usage of parameter-efficient finetuning (PEFT) strategies.

Is ‘Right’ Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning Paper

Authors: J. Jung, E. T. Kim, Seo Yeon Kim, Joo-Ho Lee, Bumsoo Kim, Buru Chang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01330
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs’ orientation understanding with the user’s perspective, based on a consistent annotation standard derived from the user’s egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs’ ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model’s capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs’ orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at https://github.com/jhCOR/EgoOrientBench.

Claim

Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications.

Olympus: A Universal Task Router for Computer Vision Tasks Paper

Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip Torr
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.01328
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks.

Claim

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks.

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge Paper

Authors: Yaqi Zhao, Yuanyang Yin, Lin Li, Mingan Lin, Victor Shea-Jay Huang, Siwei Chen, Weipeng Chen, Baoqun Yin, Zenan Zhou, Wentao Zhang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.02323
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE’s representation of visual information may not fully align with LLM’s cognitive framework, leading to a mismatch where visual features exceed the language model’s interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data—images whose ambiguous visual representations challenge the VE’s interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM’s capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the embedding space but also align with the LLM’s cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.

Claim

Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone.

Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision Paper

Authors: Xin Juan, Kaixiong Zhou, Ninghao Liu, Tianlong Chen, Xin Wang
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52733.2024.00037
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ViUniT: Visual Unit Tests for More Robust Visual Programming Paper

Authors: Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Christopher Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Year: 2024
Venue: Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR52734.2025.02295
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers, followed by image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4 points in accuracy. Notably, it enables 7B open-source language models to outperform gpt-4o-mini in visual program generation by an average of 7.7 points and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

Claim

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes.