ML + Vision Top-6 Agent Survey - ICCV 2024 - Page 2 of 2¶

Venue: IEEE International Conference on Computer Vision
Year: 2024
Page: 2 / 2
Papers: 31-50 / 50

Papers

Multimodal LLM Guided Exploration and Active Mapping Using Fisher Information Paper

Authors: Wen Jiang, Boshu Lei, Katrina Ashton, Kostas Daniilidis
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00512
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: multimodal large language models, embodied agents).
Code: Not found.
Extraction: method/data pending

Abstract

We present an active mapping system which plans for both long-horizon exploration goals and short-term actions using a 3D Gaussian Splatting (3DGS) representation. Existing methods either do not take advantage of recent developments in multimodal Large Language Models (LLM) or do not consider challenges in localization uncertainty, which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based objective. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.

Claim

We present an active mapping system which plans for both long-horizon exploration goals and short-term actions using a 3D Gaussian Splatting (3DGS) representation.

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction Paper

Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.02209
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Selfdistillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant ‘foreground bias’, where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pretrained VLM representations. DenseVLM leverages the pretrained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. This separation ensures accurate region-category alignment while maintaining semantic distinctions during training. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is publicly available https://github.com/HVision-NKU/DenseVLM.

Claim

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation Paper

Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.01956
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called M ultimodal S ign L anguage T ranslation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT. Code is available at https://github.com/hwjeon98/MMSLT.

Claim

Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language.

CompCap: Improving Multimodal Large Language Models with Composite Captions Paper

Authors: Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.02189
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for questionanswer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118 K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of \(1.7 %, 2.0 %\), and 2.9 % across eleven benchmarks, respectively.

Claim

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera.

Free2 Guide: Training-Free Text-to-Video Alignment Using Image LVLM Paper

Authors: Jaemin Kim, B. Kim, Jong Chul Ye
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.01665
Citations: 7
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Diffusion models have achieved impressive results in generative tasks for text-to-video \((T 2 V)\) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)based approaches to enhance text alignment often require differentiable reward functions trained for videos, hindering their scalability and applicability. In this paper, we propose Free \({}^{2}\) Guide, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free \({}^{2}\) Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage stitching between video frames and use system prompts to capture sequential attributions. Our framework supports the flexi-ble ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free \({}^{2}\) Guide using image-trained LVLMs significantly improves text-to-video alignment, thereby enhancing the overall video quality. Our results and code are available at our project page 11https://free2guide.github.io/.

Claim

Diffusion models have achieved impressive results in generative tasks for text-to-video \((T 2 V)\) synthesis.

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Paper

Authors: Zhiqi Ge, Juncheng Li, Xin Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.02277
Citations: 7
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: multimodal agents, vision-language models (matched: visual agent, visual agents, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updates due to platform-specific APIs, visual agents leveraging Multimodal Large Language Models (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). ISC dynamically identifies and prioritizes visually dense regions using an edge detection algorithm, enabling efficient processing by allocating more computational resources to areas with higher information density. SRDL enhances the agent's ability to handle complex tasks by leveraging a dual-learning loop, where improvements in referring (describing UI elements) reinforce grounding (locating elements) and vice versa, all without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only \(850 K\) GUI annotations, outperforming methods using 10x more training data. These improvements further translate to significant gains in both web and OS agent downstream tasks.

Claim

Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems.

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance Paper

Authors: Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00845
Citations: 6
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, mllms).
Code: Not found.
Extraction: method/data pending

Abstract

Large Vision-Language Models (LVLMs) and Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance in various general multimodal applications and have shown increasing promise in specialized domains. However, their potential in the insurance do-main-characterized by diverse application scenarios and rich multimodal data—remains largely underexplored. To date, there is no systematic review of multimodal tasks, nor a benchmark specifically designed to assess the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance industry. This study systematically reviews and categorizes multimodal tasks for 4 representative types of insurance: auto, property, health, and agricultural. We introduce INS-MMBench, the first hierarchical benchmark tailored for the insurance domain. INS-MMBench encompasses 22 fundamental tasks, 12 meta-tasks and 5 scenario tasks, enabling a comprehensive and progressive assessment from basic capabilities to real-world use cases. We benchmark 11 leading LVLMs, including closed-source models such as GPT-4o and opensource models like LLaVA. Our evaluation validates the effectiveness of INS-MMBench and offers detailed insights into the strengths and limitations of current LVLMs on a variety of insurance-related multimodal tasks. We hope that INS-MMBench will accelerate the integration of LVLMs into the insurance industry and foster interdisciplinary research. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.

Claim

Large Vision-Language Models (LVLMs) and Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance in various general multimodal applications and have shown increasing promise in specialized domains.

Teaching VLMs to Localize Specific Objects from In-Context Examples Paper

Authors: Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hildegard Kuehne, Raja Giryes, Rogério Feris, Leonid Karlinsky, James R. Glass, Assaf Arbelle, S. Ullman, M. J. Mirza
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00893
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) - each with a category label and bounding box - and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from \(7 B\) to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs - exposing critical weaknesses in presentday VLMs, and laying a foundation for future research in context-driven vision-language applications.

Claim

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks.

Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning Paper

Authors: Junjie Shan, Ziqi Zhao, Jialin Lu, Rui Zhang, S. Yiu, Ka-Ho Chow
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00261
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Foundation models that bridge vision and language have made significant progress. While they have inspired many life-enriching applications, their potential for abuse in creating new threats remains largely unexplored. In this paper, we reveal that vision-language models (VLMs) can be weaponized to enhance gradient inversion attacks (GIAs) in federated learning (FL), where an FL server attempts to reconstruct private data samples from gradients shared by victim clients. Despite recent advances, existing GIAs struggle to reconstruct high-resolution images when the victim has a large local data batch. One promising direction is to focus reconstruction on valuable samples rather than the entire batch, but current methods lack the flexibility to target specific data of interest. To address this gap, we propose Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. It enables a brand new privacy attack experience: attackers can describe, in natural language, the data they consider valuable, and Geminio will prioritize reconstruction to focus on those highvalue samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Geminio can be launched at any FL round and has no impact on normal training (i.e., the FL server can steal clients' data while still producing a high-utility ML model as in benign scenarios). Extensive experiments demonstrate its effectiveness in pinpointing and reconstructing targeted samples, with high success rates across complex datasets and large batch sizes with resilience against defenses.

Claim

Foundation models that bridge vision and language have made significant progress.

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? Paper

Authors: Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Cai Kang, Hang Su, Yinpeng Dong, Xingxing Wei
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00740
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework capable of generating physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from single-view observations. In AdvDreamer, we integrate three key innovations: Firstly, to characterize real-world 3D variations with limited prior knowledge precisely, we design a zeroshot Monocular Pose Manipulation pipeline built upon generative 3D priors. Secondly, to ensure the visual quality of worst-case Adv-3DT samples, we propose a Naturalness Reward Model that provides continuous naturalness regularization during adversarial optimization, effectively preventing convergence to hallucinated or unnatural elements. Thirdly, to enable systematic evaluation across diverse VLM architectures and visual-language tasks, we introduce the Inverse Semantic Probability loss as the adversarial optimization objective, which solely operates in the fundamental visual-textual alignment space. Based on the captured Adv-3DT samples with high aggressiveness and transferability, we establish MM3DTBench, the first VQA benchmark dataset tailored to evaluate VLM robustness under challenging 3D variations. Extensive evaluations of representative VLMs with varying architectures reveal that real-world 3D variations can pose severe threats to model performance across various tasks.

Claim

Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored.

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models Paper

Authors: Wei Suo, Ji Ma, Mengyang Sun, L. Wu, Peng Wang, Yanning Zhang
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.01883
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of acceleration scenarios. The code for this work is publicly available at https://github.com/ASGO-MM/Pruning-All-Rounder.

Claim

Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application.

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models Paper

Authors: Junho Kim, Hyungjin Chung, Byung-Hoon Kim
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.02125
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pretrained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art on the MP-100 benchmark in the 1-shot and even 5-shot setting, marking a significant advancement in the field of categoryagnostic pose estimation. Code is available here.

Claim

Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories.

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models Paper

Authors: Xudong Li, Zihao Huang, Yan Zhang, Yunhang Shen, Ke Li, Xiawu Zheng, Liujuan Cao, Rongrong Ji
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00972
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Image Quality Assessment (IQA) remains an unresolved challenge in computer vision due to complex distortions, diverse image content, and limited data availability. Existing Blind IQA (BIQA) methods largely rely on extensive human annotations, which are labor-intensive and costly due to the demanding nature of creating IQA datasets. To reduce this dependency, we propose the Gradient-Regulated MetaPrompt IQA Framework (GRMP-IQA), designed to efficiently adapt the visual-language pre-trained model, CLIP, to IQA tasks, achieving high accuracy even with limited data. GRMP-IQA consists of two core modules: (i) MetaPrompt Pre-training Module and (ii) Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during finetuning, focusing the model's attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting. Notably, utilizing just 20 % of the training data, GRMP-IQA is competitive with most existing fully supervised BIQA approaches. Our code is available via https://github.com/LXDxmu/GRMP-IQA.

Claim

Image Quality Assessment (IQA) remains an unresolved challenge in computer vision due to complex distortions, diverse image content, and limited data availability.

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning Paper

Authors: A. Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Martin R. Oswald, Yuki Asano, Cees G. M. Snoek
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00134
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate our approach on several standard benchmark datasets, encompassing grounded image captioning, zero-shot localization, and visual grounding tasks. Our method consistently delivers strong performance across all tasks, while retaining the pre-trained image understanding capabilities.

Claim

Spatial awareness is key to enable embodied multimodal AI systems.

Trust but Verify: Programmatic VLM Evaluation in the Wild Paper

Authors: Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00312
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) frequently hallucinate responses to visual queries, undermining their reliability for critical applications. However, quantifying the effect of such hallucinations in free-form responses to open-ended queries requires visually verifying each claim within the response, which is highly challenging. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model with a high-fidelity scene-graph representation constructed from a detailed image caption, and prompt it to generate i) diverse and challenging question-answer (QA) pairs that test a range of image understanding capabilities, and ii) programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.6k challenging but grounded visual QA pairs. Next, we propose a scene graph-based evaluation framework to programmatically measure both the helpfulness and truthfulness of free-form VLM responses to questions from our benchmark that does not rely on subjective LLM judgments. We extensively benchmark a range of VLMs on PROVE, and uncover a concerning tradeoff where models that provide more helpful responses often hallucinate more, whereas truthful models tend to be less informative. PROVE serves as a foundation for developing nextgeneration VLMs that balance helpfulness with truthfulness. Project page: prove-explorer.netlify.app

Claim

Vision-Language Models (VLMs) frequently hallucinate responses to visual queries, undermining their reliability for critical applications.

Understanding Museum Exhibits using Vision-Language Reasoning Paper

Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, D. Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, D. Paudel, Luc Van Gool
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00215
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65 M images and 200 M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; © benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings. The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP [41] with visionlanguage aligned embeddings, but lacking the expressive power of large language models, and the LLaVA [46] model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models excel in queries requiring deeper historical context and reasoning. We further demonstrate the necessity of finetuning models on large-scale domain-specific datasets by showing that our fine-tuned models significantly outperform current SOTA VLMs in answering questions related to specific attributes, highlighting their limitations in handling complex, nuanced queries. Our dataset, benchmarks, and source code are available at: insait-institute/Museum-65.

Claim

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models.

TAB: Transformer Attention Bottlenecks Enable User Intervention and Debugging in Vision-Language Models Paper

Authors: Pooyan Rahmanzadehgervi, Hung Huy Nguyen, Rosanne Liu, Long Mai, Anh Totti Nguyen
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.02094
Citations: 2
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Multi-head self-attention (MHSA) is a key component of Transformers [70], a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model [9], [17]. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention [70], TAB constrains the total attention over all patches to \(\in[0], [1]\). That is, when the total attention is 0, no visual information is propagated further into the network, and the vision-language model (VLM) would default to a generic, image-independent response (Fig. 1j). To demonstrate the advantages of TAB, we train VLMs with TAB to perform image-difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to debug by editing attention, which often produces expected outputs by VLMs.

Claim

Multi-head self-attention (MHSA) is a key component of Transformers [70], a widely popular architecture in both language and vision.

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning Paper

Authors: Pengkun Jiao, Bin Zhu, Jingjing Chen, C. Ngo, Yu-Gang Jiang
Year: 2024
Venue: IEEE International Conference on Computer Vision
DOI: 10.1109/ICCV51701.2025.00262
Citations: 1
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the visionlanguage projection with local details. Our approach is both memory- and time-efficient, requiring only \(1.16 \times\) the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods. Our project page is available at https://github.com/pengkun-jiao/Dual-LoRA.

Claim

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead.