ML + Vision Top-6 Agent Survey - ICCV 2025 - Page 5 of 6

  • Venue: IEEE International Conference on Computer Vision
  • Year: 2025
  • Page: 5 / 6
  • Papers: 121-150 / 153
Causality-Guided Prompt Learning for Vision-Language Models via Visual Granulation Paper
  • Authors: Mengyu Gao, Qiulei Dong
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00114
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets. Code is available at https://github.com/GaoMY-521/CaPL_Code.

Claim

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks.

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph Paper
  • Authors: Kaname Yokoyama, Chihiro Nakatani, N. Ukita
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00975
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

Claim

This paper proposes dynamic human group detection in videos.

Enhancing Numerical Prediction of MLLMS With Soft Labeling Paper
  • Authors: Pei Wang, Zhaowei Cai, Hao Yang, Davide Modolo, Ashwin Swaminathan
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00327
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection Paper
  • Authors: Wenjun Miao, Guansong Pang, Zihang Wang, Jingyi Zheng, Xiaolong Bai
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00454
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Learning Beyond Still Frames: Scaling Vision-Language Models with Video Paper
  • Authors: Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue, Cuhk MMLab
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02082
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Layer-Wise Vision Injection With Disentangled Attention for Efficient LVLMs Paper
  • Authors: Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang, Zhongying Liu, Yujie Zhong, Tongtong Yuan
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00658
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens Paper
  • Authors: Runpeng Yu, Xinyin Ma, Xinchao Wang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02026
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

GenieBlue: Integrating Both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices Paper
  • Authors: Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, et al.
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00400
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

Claim

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices.

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs Paper
  • Authors: Zitian Wang, Yue Liao, Kang Rong, Fengyun Rao, Yibo Yang, Si Liu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00195
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.

Claim

Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning.

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-Based Attacks Paper
  • Authors: Jiawei Wang, Yushen Zuo, Yuanjun Chai, Liu Zhendong, Yicheng Fu, Yichun Feng, Kin-Man Lam
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00266
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned imagetext pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPureRobustVLM

Claim

Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images.

FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection Paper
  • Authors: Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Weishi Zheng, Ruixuan Wang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00115
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external largescale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the InDistribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e. forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/OxFAFA/FA.

Claim

Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently.

One Object, Multiple Lies: A Benchmark for Cross-Task Adversarial Attack on Unified Vision-Language Models Paper
  • Authors: Jiale Zhao, Xinyang Jiang, Junyao Gao, Yuhao Xue, Cairong Zhao
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00025
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Unified vision-language models (VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the objectchange objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient regioncentric attack method. Extensive experiments on Florence2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks. Code is available at https://github.com/Gwill-Z/CRAFT.

Claim

Unified vision-language models (VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture.

One Last Attention for Your Vision-Language Model Paper
  • Authors: Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00144
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decisionmaking process, i.e. rational matrix that drives the final prediction [5]. To bridge the gap, we propose a simple yet effective Rational Adaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e. updating, or freezing pretrained encoders in adaptation, and testtime training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at github.com/khufia/RAda.

Claim

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning.

PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation Paper
  • Authors: Zhihao Zhu, Yifan Zheng, Siyu Pan, Yaohui Jin, Yao Mu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00837
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordanceaware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semanticaffordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.

Claim

The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation.

Controlling Multimodal Llms Via Reward-Guided Decoding Paper
  • Authors: Oscar Mañas, P. D'Oro, Koustuv Sinha, Adriana Romero-Soriano, M. Drozdzal, Aishwarya Agrawal
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00137
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

Claim

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs.

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images Paper
  • Authors: J. Song, Jiamu Wang, A. Nguyen, Keunho Byeon, Sangjeong Ahn, Sung Hak Lee, Jin Tae Kwak
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02049
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained visionlanguage model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.

Claim

Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing.

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles Paper
  • Authors: Eric Slyman, Md Mehrab Tanjim, Kushal Kafle, Stefan Lee
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01600
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

Claim

Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context.

Automated Red Teaming for Text-to-Image Models Through Feedback-Guided Prompt Iteration with Vision-Language Models Paper
  • Authors: Wei Xu, Kangjie Chen, Jiawei Qiu, Yuyang Zhang, Run Wang, Jin Mao, Tianwei Zhang, Lina Wang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01726
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue Paper
  • Authors: Guohao Sun, Can Qin, Yihao Feng, Zeyuan Chen, Ran Xu, S. Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00077
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: large vision language model, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning Paper
  • Authors: Xinyu Sun, Zhikun Zhao, Congyan Lang, Bing Li, Juan Wang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00048
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ZipVL: Accelerating Vision-Language Models Through Dynamic Token Sparsity Paper
  • Authors: Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01904
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

LIRA: Reasoning Reconstruction via Multimodal Large Language Models Paper
  • Authors: Zhen Zhou, Tong Wang, Yunkai Ma, Xiaoyu Tan, Fengshui Jing
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00172
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model Paper
  • Authors: Yuxuan Luo, Jiaqi Tang, Chenyi Huang, Feiyang Hao, Zhouhui Lian
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02138
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (\(C C^{2}\)) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous \(O C R\) and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and realworld benchmarks confirm its robust generalization ability.

Claim

Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity.

HRScene: How Far are VLMs from Effective High-Resolution Image Understanding? Paper
  • Authors: Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, et al.
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02128
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from \(1,024 \times 1,024\) to \(35,503 \times 26,627\). HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

Claim

High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels.

Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation Paper
  • Authors: Tiankai Chen, Yushu Li, Adam Goodge, Fei Teng, Xulei Yang, Tianrui Li, Xun Xu
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02674
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for \(2 D\) image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and realworld datasets for 3D point cloud OOD detection.

Claim

Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical.

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs Paper
  • Authors: Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02016
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The codes will be available at https://github.com/ZJHTerry18/DisCo.

Claim

In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input.

ART: Adaptive Relation Tuning for Generalized Relation Prediction Paper
  • Authors: Gopika Sudhakaran, Hikaru Shindo, P. Schramowski, Simone Schaub-Meyer, K. Kersting, Stefan Roth
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01515
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instructiontuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

Claim

Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene.

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation Paper
  • Authors: Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01879
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN [25] trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.

Claim

We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images.

Benchmarking Multimodal Large Language Models Against Image Corruptions Paper
  • Authors: Xinkuan Qiu, Meina Kan, Yongbin Zhou, Shiguang Shan
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00843
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning Paper
  • Authors: Zongyao Xue, Meina Kan, Shiguang Shan, Xilin Chen
  • Year: 2025
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00302
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: large vision language model, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.