ML + Vision Top-6 Agent Survey - ICCV 2025 - Page 2 of 6

  • Venue: IEEE International Conference on Computer Vision
  • Year: 2025
  • Page: 2 / 6
  • Papers: 31-60 / 153
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma? Paper
  • Authors: Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01942
  • Citations: 18
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the “Sampling Dilemma”: low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions-where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, shortduration actions to rigorously assess the sampling strate-gies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlabresearch/LSDBench

Claim

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding.

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing Paper
  • Authors: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01785
  • Citations: 17
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with \(30 \times\) less training data and \(13 \times\) smaller model size. All data and models are open-sourced on Github for future research.

Claim

Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs.

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Paper
  • Authors: Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01641
  • Citations: 16
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220 K visual matching data with reasoning annotation. To our knowledge, this is the first visual corresponding dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL72B, by 7.15 % and \(11.72 % O A\), respectively. These results demonstrate the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models will be released.

Claim

Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding.

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models Paper
  • Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxian Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, D. Leng, Yuhui Yin
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01686
  • Citations: 16
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images as shown in Figure 1. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential. The code is available at: https://github.com/360CVGroup/PlanGen.

Claim

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images as shown in Figure 1.

Tikzero: Zero-Shot Text-Guided Graphics Program Synthesis Paper
  • Authors: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01653
  • Citations: 16
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: program synthesis).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, \(T_{\text {IK }} Z_{\text {ERO matches }}\) or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.11https://github.com/potamides/DeTikZify

Claim

Automatically synthesizing figures from text captions is a compelling capability.

Vision-Language Models Can't See the Obvious Paper
  • Authors: Yasser Dahou, Ngoc Dung Huynh, Phúc H. Lê Khắc, W. Para, Ankit Singh, Sanath Narayan
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02239
  • Citations: 15
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention.

Claim

We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones.

ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving Paper
  • Authors: Yuhang Lu, Jiadong Tu, Yuexin Ma, Xinge Zhu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02579
  • Citations: 15
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates highlevel driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and humanlike trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: 4dvlab.github.io/project_page/realad

Claim

End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability.

DASH: Detection and Assessment of Systematic Hallucinations of VLMs Paper
  • Authors: Maximilian Augustin, Yannic Neuhaus, Matthias Hein
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02112
  • Citations: 14
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presence of certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, largescale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the “natural image manifold” to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than \(19 k\) clusters with \(950 k\) images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

Claim

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presence of certain objects in an image.

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI Paper
  • Authors: Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00033
  • Citations: 14
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research.

Claim

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence.

Dynamic Multimodal Prototype Learning in Vision-Language Models Paper
  • Authors: Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, H. Zhang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00241
  • Citations: 14
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

With the increasing attention to pre-trained visionlanguage models (VLMs), e.g., CLIP, substantial efforts have been devoted to many downstream tasks, especially in testtime adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce ProtoMM, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.

Claim

With the increasing attention to pre-trained visionlanguage models (VLMs), e.g., CLIP, substantial efforts have been devoted to many downstream tasks, especially in testtime adaptation (TTA).

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Paper
  • Authors: Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02152
  • Citations: 13
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38× real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AFA/SparseMM.

Claim

Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities.

Hierarchical Cross-Modal Prompt Learning for Vision-Language Models Paper
  • Authors: Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00184
  • Citations: 13
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottle-necks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Crossmodal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.

Claim

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities.

ChatReID: Open-Ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models Paper
  • Authors: Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian, Xiangyang Xue
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02247
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Person re-identification (Re-ID) is a fundamental task in computer vision that aims to match individuals across non-overlapping camera views. Although recent visionlanguage models (VLMs) have shown strong capabilities in logical reasoning and general-purpose learning, their performance in Re-ID remains suboptimal. Existing methods either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we present ChatReID, a novel framework that pioneers a text-sidedominated retrieval paradigm, enabling flexible and interactive Re-ID across diverse scenarios. To effectively integrate the reasoning power of language models into ReID pipelines, we first construct a large-scale instruction dataset containing over 8 million prompts to guide model adaptation. We then propose a hierarchical progressive tuning (HPT) strategy, consisting of three stages: person attribute understanding, fine-grained image retrieval, and multi-modal reasoning. Extensive experiments on ten widely-used benchmarks demonstrate that ChatReID consistently outperforms existing methods, achieving state-of-the-art results across all Re-ID tasks. Moreover, ChatReID exhibits strong reasoning ability by accurately extracting and integrating fine-grained identity cues in a coherent, multi-modal manner.

Claim

Person re-identification (Re-ID) is a fundamental task in computer vision that aims to match individuals across non-overlapping camera views.

TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination via Latent Truthful-Guided Pre-Intervention Paper
  • Authors: Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, B. Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00692
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Object Hallucination \((\text{OH})\) has been acknowledged as one of the major trustworthy challenges in Large VisionLanguage Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the “overall truthfulness” of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as “per-token” hallucination indicators, which is essential for mitigating \(O H\). In this paper, we first conduct an in-depth exploration of LVLM internal states with OH issues and discover that LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist “generic truthful directions” shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inferencetime intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and crossdata hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at https://github.com/jinhaoduan/TruthPrInt.

Claim

Object Hallucination \((\text{OH})\) has been acknowledged as one of the major trustworthy challenges in Large VisionLanguage Models (LVLMs).

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation Paper
  • Authors: Kaidong Zhang, Rongtao Xu, Pengzhen Ren, Junfan Lin, Hefeng Wu, Liang Lin, Xiaodan Liang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.48550/arXiv.2505.01709
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vlm, embodied agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained visionlanguage model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a guided embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

Claim

Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics.

HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation Paper
  • Authors: Qinqian Lei, Bo Wang, Robby T. Tan
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00178
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.

Claim

Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions.

DocThinker: Explainable Multimodal Large Language Models with Rule-Based Reinforcement Learning for Document Understanding Paper
  • Authors: Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00086
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLMbased document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.

Claim

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding.

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference Paper
  • Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02208
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision language models have received increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM, by pruning tokens in the context stage while attending to a sparse subset of visual tokens during the generation phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a \(\mathbf{1. 4} \times\) speedup on image benchmarks. This approach leads to \(+5.9 %\) average accuracy improvements on image-centric benchmarks over previous works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a \(\mathbf{3. 6}\) × and \(\mathbf{1. 7} \times\) speedup in prefill and decoding, respectively.

Claim

Vision language models have received increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos.

Hallucinatory Image Tokens: A Training-Free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs Paper
  • Authors: Liwei Che, Tony Liu, Jinghui Jia, Weiyi Qin, Ruixiang Tang, V. Pavlovic
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02009
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals a striking finding: a small subset of image tokens with high attention scores are the primary drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models and datasets. Building on this insight, we introduce EAZY, a novel, training-free method that automatically identifies and Eliminate hAllucinations by Zeroing out hallucinatorY image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving 15 % improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.

Claim

Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist.

LangBridge: Interpreting Image as a Combination of Language Embeddings Paper
  • Authors: Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02205
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visuallanguage alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://curryx-001.github.io/LangBridge.github.io/.

Claim

Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks.

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models Paper
  • Authors: Zifu Wan, Ce Zhang, Silong Yong, Martin Ma, Simon Stepputtis, Louis-philippe Morency, Deva Ramanan, Katia P. Sycara, Yaqi Xie
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00309
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in realworld applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a trainingfree decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

Claim

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning Paper
  • Authors: Jingjing Jiang, Chao Ma, Xurui Song, Hanwang Zhang, Jun Luo
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00291
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inferencetime scaling strategy that enables Corvid to mitigate overreasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing ol-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.qithub.io/corvid.

Claim

Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding.

Why LVLMs are More Prone to Hallucinations in Longer Responses: The Role of Context Paper
  • Authors: Ge Zheng, Jiaye Qian, Jiajin Tang, Sibei Yang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00391
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel “induce-detect-suppress” framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential objectlevel hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.

Claim

Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues.

Robustifying Zero-Shot Vision Language Models by Subspaces Alignment Paper
  • Authors: Junhao Dong, Piotr Koniusz, Liaoyuan Feng, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, Y. Ong
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01955
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval Paper
  • Authors: Zelong Sun, Dong Jing, Zhiwu Lu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.48550/arXiv.2502.20826
  • Citations: 10
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: lvlm, large vision language model, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and Large Language Models (LLMs) to generate target captions from composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR directly employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning of composed queries. To enhance reasoning reliability, we devise CIRCoT, which guides the LVLM to perform step-by-step reasoning by following predefined subtasks. Additionally, while most existing approaches focus solely on global-level reasoning, CoTMR introduces fine-grained predictions about the presence or absence of key elements at the object scale for more comprehensive reasoning. Furthermore, we design a MultiGrained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.

Claim

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples.

Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning Paper
  • Authors: Wenxuan Bao, Ruxi Deng, Ruizhong Qiu, Tianxin Wei, Hanghang Tong, Jingrui He
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00020
  • Citations: 10
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their trainingfree nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client's unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server's coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs. Our code is available at https://github.com/baowenxuan/Latte.

Claim

Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing.

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models Paper
  • Authors: Tengjin Weng, Jingyi Wang, Wenhao Jiang, Zhong Ming
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00365
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiplechoice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested-including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash-perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.

Claim

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks.

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers Paper
  • Authors: Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.48550/arXiv.2504.00502
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages \(L C\) to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code is publicly available at https://github.com/icip-cas/ShortV.

Claim

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens.

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction Paper
  • Authors: Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, Andreas Zell
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00809
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, often fails to achieve reliable performance because of inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, lever-aging similarity-based grounding training with 3D pseudolabels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3DnuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU. Code is available at: https://github.com/EdwardLeeLPZ/AGO.

Claim

Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects.

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation Paper
  • Authors: Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, B. Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01826
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce LLaVA-Reward11Project page: https://github.com/sjz5202/LLaVAReward., an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: textimage alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.

Claim

We introduce LLaVA-Reward11Project page: https://github.com/sjz5202/LLaVAReward., an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs).