ML + Vision Top-6 Agent Survey - CVPR 2025 - Page 4 of 5¶
Overview | Previous: CVPR 2025 p3 | Page 4 / 5 | Next: CVPR 2025 p5
- Venue: Computer Vision and Pattern Recognition
- Year: 2025
- Page: 4 / 5
- Papers: 91-120 / 124
Papers
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception Paper
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models’ response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce "Antidote", a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct "CP-Bench", a novel benchmark to evaluate LVLMs’ ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.
Claim
Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks.
The Photographer’s Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers Paper
Abstract
"While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky."John Szarkowski, William Eggleston’s Guide 1Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component—a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise—including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a language-guided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.
Claim
"While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky."John Szarkowski, William Eggleston’s Guide 1Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component—a pure color block (blue).
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models Paper
Abstract
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.
Claim
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities.
Taxonomy-Aware Evaluation of Vision–Language Models Paper
Abstract
When a vision–language model (VLM) is prompted to identify an entity depicted in an image, it may answer "I see a conifer," rather than the specific label NORWAY SPRUCE. This raises two issues for evaluation: Firstly, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., CONIFER). Secondly, a useful classification measure should give partial credit to lessspecific, but not incorrect, answers (NORWAY SPRUCE being a type of CONIFER). To meet these requirements, we propose a framework for evaluating unconstrained text predictions such as those generated from a vision–language model against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme. Data and code are made available at https://github.com/vesteinn/vlm-eval.
Claim
When a vision–language model (VLM) is prompted to identify an entity depicted in an image, it may answer "I see a conifer," rather than the specific label NORWAY SPRUCE.
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model Paper
Abstract
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an "Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a "Resisting Logical Oblivion" optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at https: //github.com/keke-nice/Period-LLM.
Claim
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals.
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models Paper
Abstract
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations. To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset-and prompt-spanning prompt variation analysis. Regarding prompt variations, PARC’s evaluation shows that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family. We further find indications that prompt sensitivity is linked to training data. https://github.com/NVlabs/PARC
Claim
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems.
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation Paper
Abstract
Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees under exchangeability assumptions. Additionally, to ensure the prediction sets contain the most practically interpretable scene graphs, we propose an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within each set. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.
Claim
Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content.
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation Paper
Abstract
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.
Claim
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models.
SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization Paper
Abstract
Snowfall presents significant challenges for visual data processing, necessitating specialized desnowing algorithms. However, existing models often fail to generalize effectively due to their heavy reliance on synthetic datasets. Furthermore, current real-world snowfall datasets are limited in scale and lack dedicated evaluation metrics designed specifically for snowfall degradation, thus hindering the effective integration of real snowy images into model training to reduce domain gaps. To address these challenges, we first introduce RealSnow10K, a large-scale, high-quality dataset consisting of over 10,000 annotated real-world snowy images. In addition, we curate a preference dataset comprising 36,000 expert-ranked image pairs, enabling the adaptation of multimodal large language models (MLLMs) to better perceive snowy image quality through our innovative Multi-Model Preference Optimization (MMPO). Finally, we propose the SnowMaster, which employs MMPO-enhanced MLLM to perform accurate snowy image evaluation and pseudo-label filtering for semi-supervised training. Experiments demonstrate that SnowMaster delivers superior desnowing performance under real-world conditions.
Claim
Snowfall presents significant challenges for visual data processing, necessitating specialized desnowing algorithms.
Language-Guided Salient Object Ranking Paper
Abstract
Salient Object Ranking (SOR) aims to study human attention shifts across different objects in the scene. It is a challenging task, as it requires comprehension of the relations among the salient objects in the scene. However, existing works often overlook such relations or model them implicitly. In this work, we observe that when Large Vision-Language Models (LVLMs) describe a scene, they usually focus on the most salient object first, and then discuss the relations as they move on to the next (less salient) one. Based on this observation, we propose a novel Language-Guided Salient Object Ranking approach (named LG-SOR), which utilizes the internal knowledge within the LVLM-generated language descriptions, i.e., semantic relation cues and the implicit entity order cues, to facilitate saliency ranking. Specifically, we first propose a novel Text-Guided Visual Modulation (TGVM) module to incorporate semantic information in the description for saliency ranking. TGVM controls the flow of linguistic information to the visual features, suppresses noisy background image features, and enables the propagation of useful textual features. We then propose a novel Text-Aware Visual Reasoning (TAVR) module to enhance model reasoning in object ranking, by explicitly learning a multimodal graph based on the entity and relation cues derived from the description. Extensive experiments demonstrate superior performances of our model on two SOR benchmarks.
Claim
Salient Object Ranking (SOR) aims to study human attention shifts across different objects in the scene.
STING-BEE : Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Paper
Abstract
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.
Claim
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans.
Task-Aware Clustering for Prompting Vision-Language Models Paper
Abstract
Prompt learning has attracted widespread attention in adapting vision-language models to downstream tasks. Existing methods largely rely on optimization strategies to ensure the task-awareness of learnable prompts. Due to the scarcity of task-specific data, overfitting is prone to occur. The resulting prompts often do not generalize well or exhibit limited task-awareness. To address this issue, we propose a novel Task-Aware Clustering (TAC) framework for prompting vision-language models, which increases the task-awareness of learnable prompts by introducing task-aware pre-context. The key ingredients are as follows: (a) generating task-aware pre-context based on task-aware clustering that can preserve the backbone structure of a downstream task with only a few clustering centers, (b) enhancing the task-awareness of learnable prompts by enabling them to interact with task-aware pre-context via the well-pretrained encoders, and © preventing the visual task-aware pre-context from interfering the interaction between patch embeddings by masked attention mechanism. Extensive experiments are conducted on benchmark datasets, covering the base-to-novel, domain generalization, and cross-dataset transfer settings. Ablation studies validate the effectiveness of key ingredients. Comparative results show the superiority of our TAC over competitive counterparts. The code is available at https://github.com/FushengHao/TAC.
Claim
Prompt learning has attracted widespread attention in adapting vision-language models to downstream tasks.
HalLoc: Token-level Localization of Hallucinations for Vision Language Models Paper
Abstract
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.
Claim
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications.
Seeing the Abstract: Translating the Abstract Language for Vision Language Models Paper
Abstract
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multi-modal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.
Claim
Natural language goes beyond dryly describing visual content.
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering Paper
Abstract
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model’s fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model’s reasoning capabilities. Our method achieves state-of-the-art performance on the OK-VQA and A-OKVQA datasets, demonstrating its robustness and effectiveness across diverse VQA scenarios.
Claim
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning.
Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening Paper
Abstract
Pansharpening aims at integrating complementary information from panchromatic and multispectral images. Available deep-learning based pansharpening methods typically perform exceptionally with particular satellite datasets. At the same time, it has been observed that these models also exhibit scene dependence, for example, if the majority of the training samples come from the urban scenes, the model’s performance may decline in the river scene. To address the domain gap produced by varying satellite sensors and distinct scenes, we propose a dual-granularity semantic guided sparse routing diffusion model for general pansharpening. By utilizing the large Vision-Language Models (VLMs) in the field of geoscience, e.g, GeoChat, we introduce the dual granularity semantics to generate dynamic sparse routing scores for adaptation of different satellite sensors and scenes. This scene-level and region-level dual-granularity semantic information serves as guidance for dynamically activating specialized experts within the diffusion model. Extensive experiments on WorldView-3, QuickBird, and GaoFen-2 datasets show the effectiveness of our proposed method. Notably, the proposed method outperforms the comparison approaches in adapting to new satellite sensors and scenes. The codes are available at https://github.com/codgodtao/SGDiff.
Claim
Pansharpening aims at integrating complementary information from panchromatic and multispectral images.
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders Paper
Abstract
Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types. Project page.
Claim
Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs).
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding Paper
Abstract
Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pretraining model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation (R2 scores up to 0.91 ) and morphology classification tasks (up to +0.17 F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.
Claim
Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding.
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels Paper
Abstract
This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieve an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Claim
This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions.
Reproducible Vision-Language Models Meet Concepts Out of Pre-Training Paper
Abstract
Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its gener-alization mechanism grasped massive research interests in the community. While existing studies limited in the scope of pre-training knowledge, hardly underpinned its generalization to countless open-world concepts absent from the pre-training regime. This paper dives into such Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, with regards to OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis evidences that despite image features of OOP concepts born with significant category margins, their zero-shot transfer significantly fails due to the poor image-text alignment. To this, we elaborate the "name-tuning" methodology with its theoretical merits in terms of OOP generalization, then propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. LAION-Beyond dataset and codes: http://m-huangx.github.io/laion_beyond/.
Claim
Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its gener-alization mechanism grasped massive research interests in the community.
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models Paper
Abstract
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. We make two contributions. First, we find that VLM scores suffer from image- and prompt-specific biases, and that simple standardization is surprisingly effective at removing these and boosting MLR performance. And second, we introduce compound prompts grounded in realistic object combinations. Our analysis reveals "AND"/"OR" signal ambiguities that cause maximum compound scores to be surprisingly suboptimal compared to second-highest scores. We introduce an adaptive fusion method to address this issue. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR. Code can be found at https://github.com/kjmillerCURIS/SPARC.
Claim
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.
Vision-Language Model IP Protection via Prompt-based Learning Paper
Abstract
Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP’s capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.
Claim
Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models.
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning Paper
Abstract
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
Claim
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution.
Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection Paper
Abstract
Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background in images leads to significant shortcut issues that adversely affect out-of-distribution (OOD) detection abilities. When confronted with a background OOD sample, VLMs are prone to misidentifying it as in-distribution (ID) data. In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. We first decouple images into ID-relevant and ID-irrelevant regions and utilize the latter to generate a large number of augmented OOD background samples as pseudo-OOD supervision. We then use the masks from background decoupling to adjust the model’s attention, minimizing its focus on ID-irrelevant regions. To assess the model’s robustness against background interference, we introduce a new OOD evaluation dataset, ImageNet-Bg, which solely consists of background images with all ID-relevant regions removed. Our method demonstrates exceptional performance in few-shot scenarios, achieving strong results even in one-shot setting, and outperforms existing methods. The code and proposed ImageNet-Bg are available at https://github.com/HAIV-Lab/OSPCoOp_Imagenet-bg.
Claim
Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks.
Libra-Merging: Importance-Redundancy and Pruning-Merging Trade-Off for Acceleration Plug-In in Large Vision-Language Model Paper
Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in recent years. However, the expensive inference cost limits the realistic deployment of LVLMs. Some works find that visual tokens are redundant and compress tokens to reduce the inference cost. These works identify important non-redundant tokens as target tokens, then prune the remaining tokens (non-target tokens) or merge them into target tokens. However, target token identification faces the token importance-redundancy dilemma. Besides, token merging and pruning face a dilemma between disrupting target token information and losing non-target token information. To solve these problems, we propose a novel visual token compression scheme, named Libra-Merging. In target token identification, Libra-Merging selects the most important tokens from spatially discrete intervals, achieving a more robust token importance-redundancy trade-off than relying on a hyper-parameter. In token compression, when non-target tokens are dissimilar to target tokens, Libra-Merging does not merge them into the target tokens, thus avoiding disrupting target token information. Meanwhile, Libra-Merging condenses these non-target tokens into an information compensation token to prevent losing important non-target token information. Our method can serve as a plug-in for diverse LVLMs, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/Libra-Merging.
Claim
Large Vision-Language Models (LVLMs) have achieved significant progress in recent years.
Adaptive Parameter Selection for Tuning Vision-Language Models Paper
Abstract
Vision-language models (VLMs) like CLIP have been widely used in various specific tasks. Parameter-efficient fine-tuning (PEFT) methods, such as prompt and adapter tuning, have become key techniques for adapting these models to specific domains. However, existing approaches rely on prior knowledge to manually identify the locations requiring fine-tuning. Adaptively selecting which parameters in VLMs should be tuned remains unexplored. In this paper, we propose CLIP with Adaptive Selective Tuning (CLIP-AST), which can be used to automatically select critical parameters in VLMs for fine-tuning for specific tasks. It opportunely leverages the adaptive learning rate in the optimizer and improves model performance without extra parameter overhead. We conduct extensive experiments on 13 benchmarks, such as ImageNet, Food101, Flowers102, etc, with different settings, including few-shot learning, base-to-novel class generalization, and out-of-distribution. The results show that CLIP-AST consistently outperforms the original CLIP model as well as its variants and achieves state-of-the-art (SOTA) performance in all cases. For example, with the 16-shot learning, CLIP-AST surpasses GraphAdapter and PromptSRC by 3.56% and 2.20% in average accuracy on 11 datasets, respectively. Code will be publicly available.
Claim
Vision-language models (VLMs) like CLIP have been widely used in various specific tasks.
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models Paper
Abstract
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permit-ting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current bench-marks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results re-veal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
Claim
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension.
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts Paper
Abstract
Current object detectors often suffer significant performance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)1 and OODG. O(OD)1 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially enhance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.
Claim
Current object detectors often suffer significant performance degradation in real-world applications when encountering distributional shifts.
Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding Paper
Abstract
3D Vision Grounding (3DVG) is a fundamental research area that enables agents to perceive and interact with the 3D world. The challenge of the 3DVG task lies in understanding fine-grained semantics and spatial relationships within both the utterance and 3D scene. To address this challenge, we propose a zero-shot neuro-symbolic framework that utilizes a large language model (LLM) as neuro-symbolic functions to ground the object within the 3D Gaussian Splatting (3DGS) representation. By utilizing 3DGS representation, we can dynamically render high-quality 2D images from various viewpoints to enrich the semantic information. Given the complexity of spatial relationships, we construct a relationship graph and chain of semantics that decouple spatial relationships and facilitate step-by-step reasoning within 3DGS representation. Additionally, we employ a grounded-aware self-check mechanism to enable the LLM to reflect on its responses and mitigate the effects of ambiguity in spatial reasoning. We evaluate our method using two publicly available datasets, Nr3D and Sr3D, achieving accuracies of 60.8% and 91.4%, respectively. Notably, our method surpasses current state-of-the-art zero-shot methods on the Nr3D dataset. In addition, it outperforms the recent supervised models on the Sr3D dataset.
Claim
3D Vision Grounding (3DVG) is a fundamental research area that enables agents to perceive and interact with the 3D world.
SLADE: Shielding against Dual Exploits in Large Vision-Language Models Paper
Abstract
Large Vision-Language Models (LVLMs) have emerged as transformative tools in multimodal tasks, seamlessly integrating pretrained vision encoders to align visual and textual modalities. Prior works have highlighted the susceptibility of LVLMs to dual exploits (gradient-based and optimization-based jailbreak attacks), which leverage the expanded attack surface introduced by the image modality. Despite advancements in enhancing robustness, existing methods fall short in their ability to defend against dual exploits while preserving fine-grained semantic details and overall semantic coherence under intense adversarial perturbations. To bridge this gap, we introduce SheiLding against Dual Exploits (SLADE), a novel unsupervised adversarial fine-tuning scheme that enhances the resilience of CLIP-based vision encoders. SLADE’s dual-level contrastive learning approach balances the granular and the holistic, capturing fine-grained image details without losing sight of high-level semantic coherence. Extensive experiments demonstrate that SLADE-equipped LVLMs set a new benchmark for robustness against dual exploits while preserving fine-grained semantic details of perturbed images. Notably, SLADE achieves these results without compromising the core functionalities of LVLMs, such as instruction following, or requiring the computational overhead (e.g., large batch sizes, momentum encoders) commonly associated with traditional contrastive learning methods.
Claim
Large Vision-Language Models (LVLMs) have emerged as transformative tools in multimodal tasks, seamlessly integrating pretrained vision encoders to align visual and textual modalities.
Overview | Previous: CVPR 2025 p3 | Page 4 / 5 | Next: CVPR 2025 p5