ML + Vision Top-6 Agent Survey - ICCV 2024 - Page 1 of 2

  • Venue: IEEE International Conference on Computer Vision
  • Year: 2024
  • Page: 1 / 2
  • Papers: 1-30 / 50
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs Paper
  • Authors: Yunqiu Xu, Linchao Zhu, Yi Yang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01642
  • Citations: 44
  • Relevance: 5 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents, vision-language models (matched: agentic, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visuallanguage problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on openended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features \(2 K\) highquality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further advance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: https://xuyunqiu.github.io/MC-Bench.

Claim

While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visuallanguage problems beyond a single image warrant further exploration.

LlaVA-CoT: Let Vision Language Models Reason Step-By-Step Paper
  • Authors: Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, Li Yuan
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00202
  • Citations: 479
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-COT11Our LLaVA-CoT is built upon Llama-3.2-Vision model [43]., a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-COT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-COT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient testtime scaling. Remarkably, with only 100 k training samples and test-time scaling, LLaVA-COT not only outperforms its base model by \(\mathbf{9. 4 %}\) on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

Claim

Large language models have demonstrated substantial advancements in reasoning capabilities.

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Paper
  • Authors: Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02080
  • Citations: 158
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: multimodal agents, computer-use agents (matched: multimodal agent, gui agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for crossapp mobile GUI navigation. GUIOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations. Each step is enriched with detailed semantic reasoning annotations, which aid the model in building cognitive processes and enhancing its reasoning abilities for complex cross-app tasks. Building on GUIOdyssey, we develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module that efficiently attends to historical screenshot tokens, balancing performance and inference speed. Extensive experiments conducted in both in-domain and out-of-domain scenarios validate the effectiveness of our approach. Moreover, we demonstrate that historial information involving actions, screenshots and context in our dataset can significantly enhances OdysseyAgent's performance on complex cross-app tasks.

Claim

Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention.

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks Paper
  • Authors: Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01037
  • Citations: 118
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vlms, embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of \(2000+\) objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.11Codes and more videos are available at https://vlabench.github.io/22Corresponding to: sdzhang23@m.fudan.edu.cn, xpqiu@fudan.edu.cn.

Claim

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks.

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Paper
  • Authors: Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01939
  • Citations: 97
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large Vision-Language Models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91 % and inference latency by 75 %, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Claim

Large Vision-Language Models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden.

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance Paper
  • Authors: Chunwei Wang, Guansong Lu, Junwei Yang, Runhu Huang, Jianhua Han, Lu Hou, Wei Zhang, Hang Xu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02007
  • Citations: 61
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation. To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer that incorporates semantic information and a progressive multi-stage training procedure. This approach reduces the dataset size to just 15M for pretraining – over four times fewer than what is typically needed – while achieving competitive or even superior performance with existing unified MLLMs, such as Janus. Additionally, to promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme. This scheme supervises the MLLM to self-assess the consistency between text descriptions and selfgenerated images, facilitating the model to interpret images more accurately and avoid unrealistic and incorrect predictions caused by misalignment in image generation. Based on our extensive experiments, our proposed ILLUME stands out and competes with state-of-the-art unified MLLMs and specialized models across various benchmarks for multimodal understanding, generation, and editing.

Claim

In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation.

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models Paper
  • Authors: Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02103
  • Citations: 51
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily prune tokens based on importance metrics, such as cumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding visual tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers. Guided by these observations, FrameFusion computes token similarities exclusively between corresponding visual tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency. We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from \(2 B\) to \(72 B\) parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks. Experiments show that FrameFusion reduces visual tokens by 70%, achieving \(1.6-3.6 \times\) end-toend speedups, with an average performance impact of less than 3%. Our code is available at https://github.com/thu-nics/FrameFusion.

Claim

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks Paper
  • Authors: M. S. Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, F. Khan, Paolo Fraccaro, Alexandre Lacoste, Salman H. Khan
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00670
  • Citations: 48
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they do not effectively address the specific challenges of geospatial applications. Generic VLM benchmarks are not designed to handle the complexities of geospatial data, an essential component for applications such as environmental monitoring, urban planning, and disaster management. Key challenges in the geospatial domain include temporal change detection, large-scale object counting, tiny object detection, and understanding relationships between entities in remote sensing imagery. To bridge this gap, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, finegrained categorization, segmentation, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales. We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific tasks, highlighting the room for further improvements. Notably, the best-performing LLaVa-OneVision achieves only 41.7% accuracy on MCQs, slightly more than GPT-4o, which is approximately double the random guess performance. Our benchmark is publicly available at https://github.com/The-AI-Alliance/GEO-Bench-VLM.

Claim

While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they do not effectively address the specific challenges of geospatial applications.

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models Paper
  • Authors: Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Xiang Bai
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00030
  • Citations: 44
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs (l-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs (s-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation. To mitigate this limitation, we propose a novel LLaVAKD framework to transfer knowledge from \(l\)-MLLMs to \(s\) MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships. Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: 1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in s-MLLMs, 2) Supervised Fine-Tuning to equip the s-MLLMs with multimodal understanding capacity, and 3) Distilled Fine-Tuning to refine s-MLLM's knowledge. Our approach significantly improves s-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available at https://github.com/Fantasyele/LLaVA-KD.

Claim

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language.

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding Paper
  • Authors: Rongchang Xie, Chengyi Du, Pinghao Song, Chang Liu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02237
  • Citations: 39
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8 % compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%. For visual generation, our model achieves a FID score of 7.73 on MJHQ-30k, surpassing the existing unified models.

Claim

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation.

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation Paper
  • Authors: Yuheng Shi, Minjing Dong, Chang Xu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02180
  • Citations: 34
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

While CLIP has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatialinvariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from subimages, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM. Trident achieves a significant improvement in the mIoU across eight popular benchmarks compared with the previous SOTA. Furthermore, it can also be utilized to generate visual prompts that enhance the performance of Large Vision-Language Models (LVLMs).

Claim

While CLIP has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal.

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Paper
  • Authors: Mark Endo, Xiaohan Wang, S. Yeung-Levy
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02119
  • Citations: 32
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent works on accelerating Vision-Language Models achieve strong performance across a variety of visionlanguage tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model. Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization. Upon further investigation, we uncover a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, on many benchmarks aiming to evaluate vision-centric capabilities, strong performance persists with the flawed pruning strategy, highlighting these benchmarks' limited ability to assess fine-grained visual capabilities. Based on these findings, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens via multistage pruning with early uniform sampling to ensure broad image coverage. With comparable computational savings, we find that FEATHER achieves more than \(\mathbf{5} \times\) performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

Claim

Recent works on accelerating Vision-Language Models achieve strong performance across a variety of visionlanguage tasks despite highly compressing visual information.

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring Paper
  • Authors: Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02130
  • Citations: 31
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, computer-use agents (matched: vision language models, large vision language models, gui agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visuallanguage co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data and codes are released at https://github.com/jefferyZhan/Griffon.

Claim

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios.

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models Paper
  • Authors: M. Teng, Xiaojun Jia, Ranjie Duan, Xinfeng Li, Yihao Huang, Zhixuan Chu, Yang Liu, Wenqi Ren
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00258
  • Citations: 31
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks. Code is available at https://github.com/MaTengSYSU/HIMRDjailbreak.

Claim

With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry.

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding Paper
  • Authors: Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01958
  • Citations: 29
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly tasks involving videos, high-resolution images, or lengthy imagetext documents. In our work, we first conduct an empirical analysis of VLMs' long-context capabilities using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE in enhancing VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLMs. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to \(256 K\) tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications. We shall release the code, model weights, and datasets to facilitate further research.

Claim

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly tasks involving videos, high-resolution images, or lengthy imagetext documents.

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM Paper
  • Authors: Hang Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01935
  • Citations: 25
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed Dynamic-VLM achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, Dynamic-VLM delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench.

Claim

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field.

UnrealZoo: Enriching Photo-Realistic Virtual Worlds for Embodied AI Paper
  • Authors: Fangwei Zhong, Kui Wu, Chu-ran Wang, Hao Chen, Hai Ci, Zhoujun Li, Yizhou Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00546
  • Citations: 24
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents, embodied ai).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce UnrealZoo, a collection of over 100 photorealistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi-agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open-world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close-loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real-world deployment.

Claim

We introduce UnrealZoo, a collection of over 100 photorealistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments.

Ideator: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves Paper
  • Authors: Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00830
  • Citations: 23
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak at-tacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR's high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of \(82 %, 88 %\), and 75 % when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR's strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT40 and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses. Disclaimer: This paper contains content that may be disturbing or offensive.

Claim

As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical.

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-Ray Diagnosis Paper
  • Authors: Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu, Huazhu Fu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01979
  • Citations: 23
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest \(\boldsymbol{X}\)-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each questionanswer pair, thereby enhancing answer comprehensibility; (2) four question types-open-ended, closed-ended, single-choice, and multiple-choice-to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at www.med-vqa.com/GEMeX.

Claim

Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images.

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension Paper
  • Authors: Wang Xiyao, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Lin Lin, L. Kevin, Furong Huang, Lijuan Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00117
  • Citations: 23
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Despite significant advancements in vision-language models (VLMs), there lack effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the selfimproving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improves VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.

Claim

Despite significant advancements in vision-language models (VLMs), there lack effective approaches to enhance response quality by scaling inference-time computation.

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image Paper
  • Authors: Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Sen Yang, Xiyue Wang, Xiaohan Xing, Linlin Shen
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.02109
  • Citations: 22
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advances in computational pathology have introduced whole slide image (WSI)-level multimodal large language models (MLLMs) for automated pathological analysis. However, current WSI-level MLLMs face two critical challenges: limited explainability in their decision-making process and insufficient attention to morphological features crucial for accurate diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphologyaware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, specifically designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. To the best of our knowledge, WSI-Bench presents the first benchmarking systematically evaluate morphological understanding capabilities in WSI analysis. To enhance the model explainability, we present WSI-LLaVA, an MLLM framework for gigapixel WSI understanding with a three-stage training strategy, which can provide detailed morphological findings to explain its final answer. For more precise model assessment in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance, focusing on clinical accuracy. Extensive evaluation on WSI-Bench reveals both the capabilities and limitations of current WSI MLLMs in morphological analysis and various pathology tasks, while demonstrating WSI-LLaVA's superior performance across all capabilities on both internal and external datasets. Source code and data are released.

Claim

Recent advances in computational pathology have introduced whole slide image (WSI)-level multimodal large language models (MLLMs) for automated pathological analysis.

PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation Paper
  • Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01433
  • Citations: 19
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models for visual content generation. However, existing approaches face a trade-off between generation diversity and controllability, struggling to meet the varying granularity demands of different image generation tasks within a unified MLLM framework. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. PUMA achieves this by unifying multi-granular visual features as both inputs and outputs of MLLMs, thus effectively meeting the distinct granularity needs for diverse generation and precise manipulation within a single framework. Following multimodal pretraining and instruction tuning, PUMA demonstrates remarkable capabilities in a wide range of multimodal tasks, including image understanding, diverse text-to-image generation, editing, inpainting, colorization, and conditional generation. This work marks a significant stride towards realizing truly unified MLLMs capable of seamlessly adapting to the diverse granularity demands and task requirements inherent in various visual tasks.

Claim

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding.

PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting Paper
  • Authors: Haoyu Zhao, Hao Wang, Xingyue Zhao, Hao Fei, Hong-qiang Wang, Chengjiang Long, Hua Zou
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00498
  • Citations: 18
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging. Current methods often require manual assignment of precise physical properties for simulations or rely on video generation models to predict them, which is computationally intensive. In this paper, we rethink the usage of multi-modal large language model (MLLM) in physics-based simulation, and present PhysSplat, a physics-based approach that efficiently endows static 3D objects with interactive dynamics. We begin with detailed scene reconstruction and object-level 3D open-vocabulary segmentation, progressing to multi-view image in-painting. Inspired by human visual reasoning, we propose MLLMbased Physical Property Perception (MLLM-P3) to predict the mean physical properties of objects in a zero-shot manner. The Material Property Distribution Prediction model (MPDP) then estimates physical property distributions via geometry-conditioned probabilistic sampling of MLLM-P3 outputs, reformulating the problem as probability distribution estimation to reduce computational costs. Finally, we simulate objects in 3D scenes with particles sampled via the Physical-Geometric Adaptive Sampling (PGAS) strategy, efficiently capturing complex deformations and significantly reducing computational costs. Extensive experiments and user studies demonstrate that our PhysSplat achieves more realistic motion than state-of-the-art methods within 2 minutes on a single GPU. Here is our project page.

Claim

Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging.

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Paper
  • Authors: Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00598
  • Citations: 16
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vlm, embodied ai).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLMbased agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 6.5% on Ego4D-VQ3D, 2.6% on OpenEQA, and 15.3% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.

Claim

This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI.

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation Paper
  • Authors: Donald Shenaj, Ondrej Bohdal, Mete Ozay, Pietro Zanuttigh, Umberto Michieli
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01497
  • Citations: 14
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles. Prior works achieved personalization by merging corresponding low-rank adapters (LoRAs) through optimization-based methods, which are computationally demanding and unsuitable for real-time use on resource-constrained devices like smartphones. To address this, we introduce LoRA.rar, a method that not only improves image quality but also achieves a remarkable speedup of over \(4000 \times\) in the merging process. We collect a dataset of style and subject LoRAs and pre-train a hypernetwork on a diverse set of content-style LoRA pairs, learning an efficient merging strategy that generalizes to new, unseen content-style pairs, enabling fast, high-quality personalization. Moreover, we identify limitations in existing evaluation metrics for content-style quality and propose a new protocol using multimodal large language models (MLLMs) for more accurate assessment. Our method significantly outperforms the current state of the art in both content and style fidelity, as validated by MLLM assessments and human evaluations.

Claim

Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles.

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay Paper
  • Authors: Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, Limin Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00353
  • Citations: 13
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Despite the remarkable performance of multi-modal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layers and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. Extensive experiments on two baseline models across 15 benchmarks show that our model matches or even surpasses the performance of corresponding baselines, while requiring only 55.6% TFLOPs and 53.7% KV cache storage during inference, and 77.7% GPU hours during training.

Claim

Despite the remarkable performance of multi-modal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement.

WalkVLM: Aid Visually Impaired People Walking by Vision Language Model Paper
  • Authors: Zhiqiang Yuan, Ting Zhang, Jiapei Zhang, Jie Zhou, Jinchao Zhang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00918
  • Citations: 12
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language model, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of visionlanguage models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires realtime streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporalaware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Dataset and code are available at https://walkvlm2024.github.io.

Claim

Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people.

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models Paper
  • Authors: Zhen Zeng, Leijiang Gu, Xun Yang, Zhangling Duan, Zenglin Shi, Meng Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00240
  • Citations: 11
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Existing knowledge editing works for MultiModal Large Language Models primarily focus on text-oriented, coarsegrained scenarios, where modifying textual content alone is sufficient. As a result, they fail to capture the unique challenges of multi-modal editing, particularly when visual information is central to knowledge representation. In this paper, we introduce a visual-oriented, fine-grained multi-modal knowledge editing task that targets precise modifications in images containing multiple interacting entities. To support this, we propose the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark, designed to evaluate the accuracy and effectiveness of multi-modal editing at a granular level. To address this challenge, we present the Multimodal Scope Classifier-based Knowledge Editor (MSCKE), a new framework that leverages a multi-modal scope classifier to integrate both textual and visual information. By accurately identifying and updating knowledge localized within images, MSCKE ensures precise editing while preserving unrelated content. Extensive experiments on the FGVEdit benchmark highlight the complexity of this new task and demonstrate that existing methods struggle with fine-grained multi-modal editing. Our results highlight MSCKE as a scalable and promising framework for advancing multi-modal knowledge editing. Code is available at https://github.com/zeng-zhen/FGVEdit.

Claim

Existing knowledge editing works for MultiModal Large Language Models primarily focus on text-oriented, coarsegrained scenarios, where modifying textual content alone is sufficient.

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues Paper
  • Authors: Francesco Taioli, Edoardo Zorzi, Gianni Franchi, A. Castellini, Alessandro Farinelli, Marco Cristani, Yiming Wang
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.01745
  • Citations: 10
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vision language models, vlms, embodied agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent. While this description is crucial for distinguishing the target from other visually similar instances, providing it prior to navigation can be demanding for humans. We thus introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolves uncertainties about the target instance during navigation in natural, template-free and open-ended dialogues with the human, minimizing user input. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the humanagent interaction reasoning using Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates internal selfdialogues within the agent to obtain a complete and accurate observation with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue, or halt navigation. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, whereas existing language-driven instance navigation methods struggle in multi-instance scenes.

Claim

Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent.

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models Paper
  • Authors: Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, A. Khasahmadi, Rahul G. Krishnan
  • Year: 2024
  • Venue: IEEE International Conference on Computer Vision
  • DOI: 10.1109/ICCV51701.2025.00687
  • Citations: 10
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

Claim

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs).