ML + Vision Top-6 Agent Survey - ICLR 2024 - Page 2 of 5

  • Venue: International Conference on Learning Representations
  • Year: 2024
  • Page: 2 / 5
  • Papers: 31-60 / 124
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models Paper
  • Authors: Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, Peilin Zhao
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2408.02032
  • Citations: 90
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.

Claim

While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments.

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Paper
  • Authors: Xiangyun Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, et al.
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.19702
  • Citations: 89
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

Claim

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding.

Towards Interpreting Visual Information Processing in Vision-Language Models Paper
  • Authors: Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.07149
  • Citations: 88
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.

Claim

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion Paper
  • Authors: Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, M. Hasegawa-Johnson, Yingzhen Li, C. Yoo
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2403.14119
  • Citations: 87
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.

Claim

In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data.

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph Paper
  • Authors: Siru Ouyang, Wenhao Yu, Kaixin Ma, Zi-Qiang Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, Dong Yu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.14684
  • Citations: 84
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation, software engineering).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large Language Models (LLMs) excel in code generation yet struggle with modern AI software engineering tasks. Unlike traditional function-level or file-level coding tasks, AI software engineering requires not only basic coding proficiency but also advanced skills in managing and interacting with code repositories. However, existing methods often overlook the need for repository-level code understanding, which is crucial for accurately grasping the broader context and developing effective solutions. On this basis, we present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions. RepoGraph offers the desired guidance and serves as a repository-wide navigation for AI software engineers. We evaluate RepoGraph on the SWE-bench by plugging it into four different methods of two lines of approaches, where RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks. Our analyses also demonstrate the extensibility and flexibility of RepoGraph by testing on another repo-level coding benchmark, CrossCodeEval. Our code is available at https://github.com/ozyyshr/RepoGraph.

Claim

Large Language Models (LLMs) excel in code generation yet struggle with modern AI software engineering tasks.

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation Paper
  • Authors: Chenxi Wang, Xiang Chen, Ningyu Zhang, Bo Tian, Haoming Xu, Shumin Deng, Huajun Chen
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.11779
  • Citations: 82
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs DeCo, which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at https://github.com/zjunlp/DeCo.

Claim

Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood.

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images Paper
  • Authors: Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip H. S. Torr, Zhifeng Li, Wei Liu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2401.11170
  • Citations: 80
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.

Claim

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks.

Negative Label Guided OOD Detection with Pretrained Vision-Language Models Paper
  • Authors: Xue Jiang, Feng Liu, Zhengfeng Fang, Hong Chen, Tongliang Liu, Feng Zheng, Bo Han
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2403.20078
  • Citations: 80
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs. Extensive research has been dedicated to exploring OOD detection in the vision modality. Vision-language models (VLMs) can leverage both textual and visual information for various multi-modal applications, whereas few OOD detection methods take into account information from the text modality. In this paper, we propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases. We design a novel scheme for the OOD score collaborated with negative labels. Theoretical analysis helps to understand the mechanism of negative labels. Extensive experiments demonstrate that our method NegLabel achieves state-of-the-art performance on various OOD detection benchmarks and generalizes well on multiple VLM architectures. Furthermore, our method NegLabel exhibits remarkable robustness against diverse domain shifts. The codes are available at https://github.com/tmlr-group/NegLabel.

Claim

Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs.

Vision Language Models are In-Context Value Learners Paper
  • Authors: Y. Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Jonathan Tompson, et al.
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2411.04549
  • Citations: 74
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (\GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and advantage-weighted regression -- all without any model training or finetuning.

Claim

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve.

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance Paper
  • Authors: Ya-Ting Lu, Shenzhi Yang, Cheng Qian, Gui-Fang Chen, Qinyu Luo, Yesai Wu, Huadong Wang, X. Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, et al.
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.12361
  • Citations: 72
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.

Claim

Agents powered by large language models have shown remarkable abilities in solving complex tasks.

Towards Semantic Equivalence of Tokenization in Multimodal LLM Paper
  • Authors: Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2406.05127
  • Citations: 64
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.

Claim

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks.

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy Paper
  • Authors: Xiang Li, Cristina Mata, Jong Sung Park, Kumara Kahatapitiya, Y. Jang, Jinghuan Shang, Kanchana Ranasinghe, R. Burgert, Mu Cai, Yong Jae Lee, M. Ryoo
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2406.20095
  • Citations: 64
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

Claim

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models.

Real2Code: Reconstruct Articulated Objects via Code Generation Paper
  • Authors: Zhao Mandi, Yijia Weng, Dominik Bauer, Shuran Song
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2406.08474
  • Citations: 63
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects' structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

Claim

We present Real2Code, a novel approach to reconstructing articulated objects via code generation.

Visual Agents as Fast and Slow Thinkers Paper
  • Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2408.08862
  • Citations: 59
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: multimodal agents (matched: visual agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System ½ modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^{v2} for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST's superior performance. Extensive testing validates the efficacy and robustness of FaST's core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. The code is available at ttps://github.com/GuangyanS/Sys2-LLaVA.

Claim

Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking.

Compositional Entailment Learning for Hyperbolic Vision-Language Models Paper
  • Authors: Avik Pal, Max van Spengler, G. M. D. D. Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.06912
  • Citations: 59
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.

Claim

Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space.

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs Paper
  • Authors: Yusu Qian, Hanrong Ye, J. Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2407.01509
  • Citations: 58
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

Claim

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities Paper
  • Authors: Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, Ziqiao Ma
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.17385
  • Citations: 58
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.

Claim

Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners.

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents Paper
  • Authors: Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2402.08178
  • Citations: 57
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.

Claim

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning.

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification Paper
  • Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2412.00876
  • Citations: 57
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by \(\sim\)75% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the \(\sim\)50% computation consumption under decoding without KV cache, while saving \(\sim\)50% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .

Claim

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction.

An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models Paper
  • Authors: Haochen Luo, Jindong Gu, Fengyuan Liu, Philip H. S. Torr
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 55
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time Paper
  • Authors: Yi Ding, Bolian Li, Ruqi Zhang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.06625
  • Citations: 54
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.

Claim

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application.

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors Paper
  • Authors: Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2402.04630
  • Citations: 51
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.

Claim

Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training.

ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning Paper
  • Authors: Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.02052
  • Citations: 51
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents, vision-language models (matched: agentic, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon tasks. To address these limitations, we present ExACT, an approach to combine test-time search and self-learning to build o1-like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate for reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge and experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' capabilities for agentic applications via test-time search and self-learning.

Claim

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks.

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation Paper
  • Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Chuang Gan
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2404.10775
  • Citations: 48
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vision language models, embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://umass-embodied-agi.github.io/COMBO/.

Claim

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world.

Diffusion Feedback Helps CLIP See Better Paper
  • Authors: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2407.20171
  • Citations: 48
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at https://github.com/baaivision/DIVA.

Claim

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks.

Text4Seg: Reimagining Image Segmentation as Text Generation Paper
  • Authors: Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.09855
  • Citations: 47
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with \(16\times16\) semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by \(3\times\), without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Claim

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge.

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond Paper
  • Authors: Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, Xianfeng Tang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2403.10667
  • Citations: 46
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on the ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covering a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.

Claim

Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration.

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning Paper
  • Authors: Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 45
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, CogCoM, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM.

Claim

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses.

MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection Paper
  • Authors: Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, Feng Zheng
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 43
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs. The commercial models performed the best, with the average accuracy of GPT-4o models reaching 74.9%. However, this result falls far short of industrial requirements. Our analysis reveals that current MLLMs still have significant room for improvement in answering questions related to industrial anomalies and defects. We further explore two training-free performance enhancement strategies to help models improve in industrial scenarios, highlighting their promising potential for future research.

Claim

In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities.

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality Paper
  • Authors: Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, Xuming Hu
  • Year: 2024
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2410.04780
  • Citations: 40
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia, but often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination. These biases arise from the visual encoder and the Large Language Model (LLM) backbone, affecting the attention mechanism responsible for aligning multimodal inputs. Existing decoding-based mitigation methods focus on statistical correlations and overlook the causal relationships between attention mechanisms and model output, limiting their effectiveness in addressing these biases. To tackle this issue, we propose a causal inference framework termed CausalMM that applies structural causal modeling to MLLMs, treating modality priors as a confounder between attention mechanisms and output. Specifically, by employing backdoor adjustment and counterfactual reasoning at both the visual and language attention levels, our method mitigates the negative effects of modality priors and enhances the alignment of MLLM's inputs and outputs, with a maximum score improvement of 65.3% on 6 VLind-Bench indicators and 164 points on MME Benchmark compared to conventional methods. Extensive experiments validate the effectiveness of our approach while being a plug-and-play solution. Our code is available at: https://github.com/The-Martyr/CausalMM

Claim

Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia, but often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination.