ML + Vision Top-6 Agent Survey - ICLR 2025 - Page 1 of 3¶

Venue: International Conference on Learning Representations
Year: 2025
Page: 1 / 3
Papers: 1-30 / 74

Papers

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Paper

Authors: Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, V. Guizilini, Yue Wang
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2501.16411
Citations: 112
Relevance: 5 / 5
Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vision language models, vlms, embodied agents, embodied ai).
Code: Not found.
Extraction: method/data pending

Abstract

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.

Claim

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments.

Re-Aligning Language to Visual Objects with an Agentic Workflow Paper

Authors: Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.23508
Citations: 4
Relevance: 5 / 5
Why selected: Heuristic keyword/alias matches: LLM agents, vision-language models (matched: tool use, agentic, vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

Claim

Language-based object detection (LOD) aims to align visual objects with language expressions.

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs Paper

Authors: Jiarui Zhang, Mahyar Khayatkhoei, P. Chhikara, Filip Ilievski
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.17422
Citations: 135
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.

Claim

Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years.

HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation Paper

Authors: Yi Li, Yuquan Deng, Jesse Zhang, J. Jang, Marius Memmel, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, †. AnqiLi, †. AbhishekGupta, Ankit Goyal, Nvidia
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.05485
Citations: 107
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results, code, and dataset are provided at: https://hamster-robot.github.io/

Claim

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics.

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments Paper

Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arik
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2501.10893
Citations: 77
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents, computer-use agents (matched: agentic, llm agents, osworld).
Code: Not found.
Extraction: method/data pending

Abstract

Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2% for ICL with Claude-3.5 and 19.5% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

Claim

Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis.

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery Paper

Authors: Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, et al.
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 62
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: language agents).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering Paper

Authors: Sheng Liu, Haotian Ye, James Zou
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 57
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding Paper

Authors: Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.10810
Citations: 42
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://github.com/sotayang/SVBench.

Claim

Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding.

Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models Paper

Authors: Ce Zhang, Zifu Wan, Zhehan Kan, Martin Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-philippe Morency, Katia P. Sycara, Yaqi Xie
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.06130
Citations: 41
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.

Claim

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios.

Understanding and Mitigating Hallucination in Large Vision-Language Models via Modular Attribution and Intervention Paper

Authors: Tianyun Yang, Ziniu Li, Juan Cao, Chang Xu
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 40
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs Paper

Authors: Feilong Tang, Zile Huang, Chengzhi Liu, Qiang Sun, Harry Yang, Sernam Lim
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 32
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

C-CLIP: Multimodal Continual Learning for Vision-Language Model Paper

Authors: Wenzhuo Liu, Fei Zhu, Longhui Wei, Qi Tian
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 32
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Flow: Modularized Agentic Workflow Automation Paper

Authors: Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 31
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: agentic, llm agents).
Code: Not found.
Extraction: method/data pending

Abstract

Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of agentic workflows during execution has not been well studied. An effective workflow adjustment is crucial in real-world scenarios, as the initial plan must adjust to unforeseen challenges and changing conditions in real time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity-on-vertex (AOV) graph, which allows continuous workflow refinement by LLM agents through dynamic subtask allocation adjustment based on historical performance and previous AOVs. To further enhance framework performance, we emphasize modularity in workflow design based on evaluating parallelism and dependency complexity. With this design, our proposed multi-agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance. Empirical results across various practical tasks demonstrate significant improvements in the efficiency of multi-agent frameworks through dynamic workflow refinement and modularization. The code is available at: https://github.com/tmllab/2025_ICLR_FLOW.

Claim

Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution.

Robotouille: An Asynchronous Planning Benchmark for LLM Agents Paper

Authors: Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.05227
Citations: 30
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
Code: Not found.
Extraction: method/data pending

Abstract

Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.

Claim

Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents.

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code Paper

Authors: Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, Roshanak Zilouchian Moghaddam
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.07832
Citations: 25
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: lm agents, language agents).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the creation of longer combined tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code.

Claim

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains.

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning Paper

Authors: Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.07906
Citations: 24
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Claim

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions.

Planning in Natural Language Improves LLM Search for Code Generation Paper

Authors: Evan Z. Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, Hugh Zhang
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 23
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming Paper

Authors: Haoyang Liu, Jie Wang, Zijie Geng, Xijun Li, Yuxuan Zong, Fangzhou Zhu, Jianye Hao, Feng Wu
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.01129
Citations: 21
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Leveraging machine learning (ML) to predict an initial solution for mixed-integer linear programming (MILP) has gained considerable popularity in recent years. These methods predict a solution and fix a subset of variables to reduce the problem dimension. Then, they solve the reduced problem to obtain the final solutions. However, directly fixing variable values can lead to low-quality solutions or even infeasible reduced problems if the predicted solution is not accurate enough. To address this challenge, we propose an Alternating prediction-correction neural solving framework (Apollo-MILP) that can identify and select accurate and reliable predicted values to fix. In each iteration, Apollo-MILP conducts a prediction step for the unfixed variables, followed by a correction step to obtain an improved solution (called reference solution) through a trust-region search. By incorporating the predicted and reference solutions, we introduce a novel Uncertainty-based Error upper BOund (UEBO) to evaluate the uncertainty of the predicted values and fix those with high confidence. A notable feature of Apollo-MILP is the superior ability for problem reduction while preserving optimality, leading to high-quality final solutions. Experiments on commonly used benchmarks demonstrate that our proposed Apollo-MILP significantly outperforms other ML-based approaches in terms of solution quality, achieving over a 50% reduction in the solution gap.

Claim

Leveraging machine learning (ML) to predict an initial solution for mixed-integer linear programming (MILP) has gained considerable popularity in recent years.

Are Large Vision Language Models Good Game Players? Paper

Authors: Xinyu Wang, Bohan Zhuang, Qi Wu
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.02358
Citations: 19
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose \method{}, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments. \method{} uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc. Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at https://github.com/xinke-wang/LVLM-Playground.

Claim

Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information.

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents Paper

Authors: Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2503.10689
Citations: 17
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Claim

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks.

DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models Paper

Authors: Kaishen Wang, Hengrui Gu, Meijun Gao, Kaixiong Zhou
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 16
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

OS-ATLAS: Foundation Action Model for Generalist GUI Agents Paper

Authors: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, P. Liang, Yu Qiao
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 16
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: computer-use agents (matched: gui agents).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

REMEDY: Recipe Merging Dynamics in Large Vision-Language Models Paper

Authors: Didi Zhu, Yibing Song, Tao Shen, Ziyu Zhao, Jinluan Yang, Min Zhang, Chao Wu
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 14
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments Paper

Authors: Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2502.19852
Citations: 13
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; © verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; © Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld

Claim

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings.

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World Paper

Authors: Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 13
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models Paper

Authors: Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, Jae-Gil Lee, Kaist
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 13
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Anyprefer: An Agentic Framework for Preference Data Synthesis Paper

Authors: Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, et al.
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2504.19276
Citations: 11
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: agentic).
Code: Not found.
Extraction: method/data pending

Abstract

High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.

Claim

High-quality preference data is essential for aligning foundation models with human values through preference learning.

Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents Paper

Authors: Hao Bai, Yifei Zhou, Erran L. Li, Sergey Levine, Aviral Kumar
Year: 2025
Venue: International Conference on Learning Representations
DOI: Not stated.
Citations: 11
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models Paper

Authors: Linh Tran, Wei Sun, Stacy Patterson, Ana Milanova
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2501.13904
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank factorization scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

Claim

Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio.

Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding Paper

Authors: Yeongjae Cho, Keonwoo Kim, Taebaek Hwang, Sungzoon Cho
Year: 2025
Venue: International Conference on Learning Representations
DOI: 10.48550/arXiv.2505.17529
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.

Claim

Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering.