ICML 2026 Spotlight: Finance × Agent × Vision¶
ICML 2026 (San Francisco, July 2026) is not yet indexed on Semantic Scholar, so the regular top-6 generator cannot reach it. This page is hand-curated from the icml.cc virtual paper index (6636 accepted papers) along three lenses relevant to the broader Alpha-Search agenda:
- Finance (8): financial agents, financial LLMs, portfolio / forecasting / explainability.
- Agent (11): LLM agents, multi-agent systems, code agents, language agents, web agents.
- Vision (10): vision-language(-action) models, multimodal LLMs, embodied perception, vision benchmarks.
Each card links to the canonical ICML poster page. Titles, authors, and abstracts were scraped from icml.cc on 2026-06-06; data may be refreshed as the venue updates metadata. Total: 29 papers.
Finance (8 papers)¶
Financial Agents & Benchmarks¶
Towards Professional-Grade Financial Agents: Benchmarking, Tooling, and Structured Reasoning Poster
Abstract
Financial reasoning requires precise execution. While Large Language Model (LLM) agents have shown encouraging progress in financial reasoning, their effectiveness in realistic financial workflows is severely hindered by the lack of holistic benchmarks and the fragility of unstructured reasoning. To evaluate these capabilities, we introduce ProFinR, the first Professional Finance Reasoning benchmark that covers four types of financial tasks, comprising 528 expert-designed tasks. To solve these complex financial reasoning questions, we construct Financial Tool Universe, a tool library containing 53 domain-specific tools organized into 13 categories. Building on the tool library, we introduce ProFinAgents, a structured agent framework based on Directed Acyclic Graph (DAG) and Case-Based Memory (CBM). Compared with strictly sequential workflows, ProFinAgent coordinates tool execution through DAG. This allows for parallel execution and reduces latency compared to serial pipelines. Furthermore, the CBM component refines decision-making over time by retrieving prior cases to mitigate reasoning failures. Experimental results demonstrate that ProFinAgent achieves a 49.81% performance gain over state-of-the-art baselines with a 47.1% reduction in inference latency.
Claim
Financial reasoning requires precise execution.
BizFinBench.v2: Towards Reliable LLMs in Finance via Real-User Data and Offline/Online Bilingual Evaluation Poster
Abstract
Large language models are becoming increasingly significant in financial applications. Nevertheless, prevailing benchmarks are largely dependent on simulated or generic data, which leads to a significant gap between reported performance and actual efficacy in real-world scenarios. To tackle this challenge, we present BizFinBench.v2, the first integrated offline and online benchmark built upon authentic user query-response data from both Chinese and U.S. equity markets. It comprises 28,860 questions across eight offline and two online tasks. Experimental results show that GPT-5 achieves a mere 61.5% accuracy, still failing to meet the practical business requirement (84.8%). Among the evaluated commercial models, DeepSeek-R1 exhibits superior investment efficacy. Error analysis grounded in real financial practice reveals persistent limitations in existing models. By overcoming the constraints of prior benchmarks, BizFinBench.v2 provides a substantiated foundation for advancing LLM deployment in the financial sector.
Claim
Large language models are becoming increasingly significant in financial applications.
Financial Forecasting & Supervision¶
The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting Poster
Abstract
While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signal-noise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.
Claim
While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized.
Position Papers: Financial AI Evaluation & Explainability¶
Position: Evaluating LLMs in Finance Requires Explicit Bias Consideration Poster
Abstract
Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up. Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for any deployment claim. We identify five recurring biases in financial LLM applications. They include look-ahead bias, survivorship bias, narrative bias, objective bias, and cost bias. These biases break financial tasks in distinct ways and they often compound to create an illusion of validity. We reviewed 164 papers from 2023 to 2025 and found that no single bias is discussed in more than 28 percent of studies. This position paper argues that bias in financial LLM systems requires explicit attention and that structural validity should be enforced before any result is used to support a deployment claim. We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design. The material is available at https://anonymous.4open.science/r/Fin-LLM-Checklists-8557/.
Claim
Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up.
Position: Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements Poster
Abstract
This position paper argues that current explainable AI (XAI) methods cannot satisfy regulatory explainability requirements for LLM-based financial systems, creating a fundamental incompatibility between technological capability and legal mandate that threatens both consumer protection and financial stability. We demonstrate through systematic analysis across six regulatory frameworks (EU AI Act, US FSOC/CFPB, UK FCA, BIS, MAS, HKMA) that post-hoc explanation techniques fail systematically when applied to large language models. Exact SHAP computation exhibits \(O(2^F)\) complexity at token-level granularity—rendering it infeasible for transformer architectures. LIME demonstrates substantial instability, with explanation rankings varying significantly across repeated evaluations of identical inputs. Chain-of-thought prompting generates unfaithful rationalizations: in controlled experiments, only 1 of 426 biased model outputs explicitly acknowledged the biasing feature in its explanation. When models learned to exploit reward hacks, they verbalized this exploitation less than 2% of the time. With 72% of UK financial firms now using AI and over $5 trillion in US consumer credit outstanding requiring adverse action explanations, this gap creates systemic risk affecting millions of consumers who may receive inadequate explanations for consequential financial decisions. We analyze three high-stakes domains—credit, trading, advisory—with documented regulatory enforcement cases, examine six counterarguments including hybrid architectures and outcome-based regulation, and propose prioritized recommendations with quarterly timelines. The status quo constitutes regulatory compliance theater; we call for either fundamental advances in LLM interpretability or deployment constraints matching current capabilities.
Claim
This position paper argues that current explainable AI (XAI) methods cannot satisfy regulatory explainability requirements for LLM-based financial systems, creating a fundamental incompatibility between technological capability and legal mandate that threatens both consumer protection and financial stability.
Portfolio Optimization & Conformal Methods¶
A Linearly Convergent Proximal Subgradient Algorithm for Sparse Portfolio Optimization with Transaction Cost Poster
Abstract
Transaction cost optimization (TCO) of online portfolio selection is crucial in computing science, due to the significant impact of transaction costs in practical short-term trading. Moreover, sparsity of portfolio vector is often desired to enhance stability and decrease risk. However, there is a lack of models considering transaction costs and sparsity simultaneously in the literature. In this paper, we first propose a \(K\)-sparse TCO model that minimizes the negative return and transaction costs while keeping the portfolio vector being \(K\)-sparse. Noting that the model is NP-hard due to the \(K\)-sparse constraint, we bypass this difficulty by reformulating the sparse model to a nonsmooth difference of convex (DC) optimization problem. We show that both problems are equivalent by proving that the penalty parameter is large enough. Then, to overcome the difficulty caused by the nonsmoothness and the simplex constraint of the model, we develop a proximal subgradient algorithm (PSGA) to solve the DC problem and apply the alternating direction of multipliers (ADMM) to compute the proximity operator of the corresponding function. Furthermore, we establish the global convergence of the entire sequence generated by PSGA through showing the surrogate function satisfies the Kurdyka-Łojasiewicz (KL) property. In addition, by showing the KL exponent of the surrogate function is \(1/2\), we establish the R-linear convergence rate of PSGA for any arbitrary initiaal point. Finally, we compare our proposed algorithm with other state-of-the-art strategies on four benchmark real-market data sets, with the numerical results showing that the proposed algorithm achieves lower risk while keeping higher return than classical TCO models.
Claim
Transaction cost optimization (TCO) of online portfolio selection is crucial in computing science, due to the significant impact of transaction costs in practical short-term trading.
Decision-focused Sparse Tangent Portfolio Optimization Poster
Abstract
Sparse tangent portfolio optimization aims to learn an interpretable, low-cardinality portfolio in the tangency direction of the mean–variance frontier, yet the associated cardinality-constrained formulation is NP-hard and standard predict-then-optimize pipelines often misalign forecasting accuracy with downstream portfolio quality. We propose an end-to-end decision-focused learning framework that reformulates Sharpe-ratio maximization as a Disciplined Parametrized Programming (DPP)-compliant convex programming layer and replaces discrete selection with a smooth top-k operator enforcing an exact sum-to-k sparsity budget. This enables gradient flow through prediction, asset selection, and re-optimization, allowing the predictive model to directly optimize the portfolio performance. Across five major equity markets, our method consistently delivers higher out-of-sample Sharpe ratios than historical and prediction-focused baselines while producing meaningful sparse selections.
Claim
Sparse tangent portfolio optimization aims to learn an interpretable, low-cardinality portfolio in the tangency direction of the mean–variance frontier, yet the associated cardinality-constrained formulation is NP-hard and standard predict-then-optimize pipelines often misalign forecasting accuracy with downstream portfolio quality.
Online Conformal Prediction via Universal Portfolio Algorithms Poster
Abstract
Online conformal prediction (OCP) seeks prediction intervals that achieve long-run \(1-\alpha\) coverage for arbitrary (possibly adversarial) data streams, while remaining as informative as possible. Existing OCP methods often require manual learning-rate tuning to work well, and may also require algorithm-specific analyses. Here, we develop a general regret-to-coverage theory for interval-valued OCP based on the \((1-\alpha)\)-pinball loss. Our first contribution is to identify linearized regret as a key notion, showing that controlling it implies coverage bounds for any online algorithm. This relies on a black-box reduction that depends only on the Fenchel conjugate of an upper bound on the linearized regret. Building on this theory, we propose UP-OCP, a parameter-free method for OCP, via a reduction to a two-asset portfolio selection problem, leveraging universal portfolio algorithms. We show strong finite-time bounds on the miscoverage of UP-OCP, even for polynomially growing predictions. Extensive experiments support that UP-OCP delivers consistently better size/coverage trade-offs than prior online conformal baselines.
Claim
Online conformal prediction (OCP) seeks prediction intervals that achieve long-run \(1-\alpha\) coverage for arbitrary (possibly adversarial) data streams, while remaining as informative as possible.
Agent (11 papers)¶
Code & Repository Agents¶
Evaluating Agentic Optimization on Large Codebases Poster
Abstract
Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and 264.6 community-maintained performance workloads per task, enabling evaluation of the full optimization lifecycle—triage, diagnosis, and resolution—under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.
Claim
Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints.
Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents Poster
Abstract
Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with deep protocol-level logic bugs involving complex state-dependent behaviors across multiple execution stages. We present Agora, a domain-aware multi-agent framework that integrates hypothesis-driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain-specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single-function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state-of-the-art LLMs. Agora discovers 15 previously unknown protocol-level logic bugs that violate safety properties, while existing LLM-based agents fail to detect any such protocol-level logic bugs. Our results demonstrate that domain-aware multi-agent collaboration is essential for detecting deep logic bugs in complex protocols.
Claim
Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses.
CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks? Poster
Abstract
Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CoDA-Bench, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CoDA-Bench comprises 1,202 tasks spanning 53 communities, with each task environment containing an average of 700 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 56.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.
Claim
Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development.
Multi-Agent, Memory & Long-Horizon Patterns¶
CausalGame: Benchmarking Causal Thinking of LLM Agents in Games Poster
Abstract
Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguish causation from correlation and hidden biases, is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, none of them are designed with the consideration of hidden biases and confounders, that widely exist in real-world scientific discovery. To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 16 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a rigorous measurement of capabilities essential to AI Scientist agents.
Claim
Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs).
MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems Poster
Abstract
Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms.
Claim
Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories.
Training-Free Hierarchical Working Memory for Small Language Model Agents Poster
Abstract
Small language models (SLMs) are attractive for agent deployment, but they struggle to reliably retain and reuse decision-relevant state information over long interactions. This issue is exacerbated when working memory is maintained via unstructured natural-language summarization. Some recent work addresses this limitation by fine-tuning or distilling smaller models to better construct and utilize working memory, but such approaches typically incur substantial additional training cost and require continuous data construction. We present a training-free working-memory framework for SLM-based agents that makes decision-relevant state explicit: conditioned on the active (sub)goal, the agent maintains a compact information state needed for progress assessment and the currently effective action set. Our approach decomposes tasks into subgoals and organizes memory hierarchically into task-level global memory and subtask-level local memory, where local memory directly conditions SLM action selection and is updated from new observations. To instantiate goal-conditioned memories without parameter updates, we introduce an offline LLM-based induction pipeline that builds a reusable schema once per task family from a small number of representative traces. Training-free refers to no parameter updates of the deployed SLM and no online LLM calls; we only use a one-time offline LLM-based schema induction per task family. On ALFWorld valid_unseen, a 4B SLM achieves 0.910 success, while representative prompting and prior working-memory baselines under the same setting remain below 0.320.
Claim
Small language models (SLMs) are attractive for agent deployment, but they struggle to reliably retain and reuse decision-relevant state information over long interactions.
From Outcomes to Actions: Leveraging Hindsight for Long-Horizon Language Agent Training Poster
Abstract
Reinforcement learning (RL) has become a widely adopted technique for improving large language models (LLMs) on complex tasks. Despite this progress, existing RL methods still face challenges in training agents with longer-horizon interactions. One major bottleneck is distinguishing the contribution of different actions in long-horizon interaction, leading to high optimization variance. To address this, we introduce a novel policy gradient method, Hindsight Policy Optimization (HPO), that projects both the current policy distribution and the hindsight distribution into an intent space and extracts low-variance learning signals from the Wasserstein distance between them. We theoretically and empirically show that aggregating semantically similar states and actions in the intent space yields a bounded-variance estimator and improves policy performance stably. Our code is available online.
Claim
Reinforcement learning (RL) has become a widely adopted technique for improving large language models (LLMs) on complex tasks.
Specialized & Cross-Domain Agents¶
PDAgent: An LLM-Driven Autonomous Agent Framework Towards *In Silico* Protein Design via Directed Mutation Poster
Abstract
Computational protein design holds immense promise across diverse domains, but existing approaches face significant challenges: traditional physics-based methods require substantial domain expertise, while emerging deep learning methods often rely on restricted functional ontologies, struggle to bridge the semantic gap between text and protein sequences, or lack closed-loop optimization mechanisms. In this paper, we present PDAgent, an LLM-driven autonomous agent framework that enables in silico protein design through template-based directed mutation. Our framework accepts natural language specifications of desired protein properties and employs a ReAct-style reasoning loop comprising five phases: THINK, PLAN, ACT, OBSERVE, and REFLECT. PDAgent integrates template retrieval, conservation-aware mutation strategies, and domain-specific computational tools for property optimization across eight biophysical dimensions. Experiments on 100 diverse protein design tasks demonstrate that PDAgent achieves a 91.34% average constraint satisfaction rate with high structural quality (mean pLDDT 87.69), substantially outperforming both direct LLM generation and specialized deep learning methods.
Claim
Computational protein design holds immense promise across diverse domains, but existing approaches face significant challenges: traditional physics-based methods require substantial domain expertise, while emerging deep learning methods often rely on restricted functional ontologies, struggle to bridge the semantic gap between text and protein sequences, or lack closed-loop optimization mechanisms.
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents Poster
Abstract
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.
Claim
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.
De-Linearizing Agent Traces: Bayesian Inference of Latent Partial Orders for Efficient Execution Poster
Abstract
AI agents increasingly execute procedural workflows as sequential action traces, which obscures latent concurrency and induces repeated step-by-step reasoning. We introduce BPOP, a Bayesian framework that infers a latent dependency partial order from noisy linearized traces. BPOP models traces as stochastic linear extensions of an underlying graph and performs efficient MCMC inference via a tractable frontier-softmax likelihood that avoids #P-hard marginalization over linear extensions. We evaluate on our open-sourced Cloud-IaC-6, a suite of cloud provisioning tasks with heterogeneous LLM-generated traces, and WFCommons scientific workflows. BPOP recovers dependency structure more accurately than trace-only and process-mining baselines, and the inferred graphs support a compiled executor that prunes irrelevant context, yielding substantial reductions in token usage and execution time.
Claim
AI agents increasingly execute procedural workflows as sequential action traces, which obscures latent concurrency and induces repeated step-by-step reasoning.
Vision (10 papers)¶
Vision-Language-Action (VLA) Models¶
NeurVLA: Unleashing Failure-Handling Capability of Vision-Language-Action Models via Neural-Symbolic Reasoning Poster
Abstract
Vision-Language-Action models have recently shown promising progress in embodied robotic manipulation, yet their generalization to diverse open-ended embodied tasks is often hindered by execution failures. While prior work has explored failure handling, existing approaches still suffer from two fundamental limitations: coarse-grained failure correction and unreliable failure prevention. These limitations lead to brittle decision-making when VLA models are deployed in novel tasks and environments. To address them, we propose NeurVLA, a neural-symbolic framework that jointly addresses failure correction and prevention via neural-symbolic reasoning and further internalizes these failure-handling capabilities into VLA models. Experiments demonstrate that NeurVLA achieves strong performance and robust generalization across diverse tasks. Code is provided in the supplementary material.
Claim
Vision-Language-Action models have recently shown promising progress in embodied robotic manipulation, yet their generalization to diverse open-ended embodied tasks is often hindered by execution failures.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models Poster
Abstract
Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy’s value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.
Claim
Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks.
Vision-Language Model Robustness & Reliability¶
Robustifying Vision-Language Models via Test-Time Prompt Adaptation Poster
Abstract
Pre-trained Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot generalization, but their performance degrades sharply under adversarial perturbations. Existing test-time adaptation methods typically rely on sample-level confidence heuristics, overlooking the intrinsic distributional structure of the data. This sample-centric approach limits robustness, as it fails to distinguish confident adversarial mispredictions from true semantic consistency. In this work, we observe that adversarial distortion is structurally brittle: while holistic representations are corrupted, semantic integrity is often preserved in the distribution of augmented views. Motivated by this insight, we propose $RITA$, a \(**R**\)obust test-t\(**I**\)me promp\(**T**\) \(**A**\)daptation framework that shifts from sample-level estimates to distribution-level alignment. Specifically, $RITA$ employs optimal transport to align the distribution of augmented visual features with textual prototypes, mitigating adversarial outliers and rectifying cross-modal semantic misalignment. Furthermore, we introduce a dynamic cache to progressively accumulate reliable cues from the test stream for online refinement. Extensive experiments demonstrate that $RITA$ significantly improves adversarial robustness without compromising clean accuracy.
Claim
Pre-trained Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot generalization, but their performance degrades sharply under adversarial perturbations.
Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models Poster
Abstract
Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks—lack of semantic guidance and vulnerability to view variations—collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader VLMs such as BioMedCLIP.
Claim
Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability.
Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models Poster
Abstract
Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations—arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall , a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options ( e.g. , “All are correct” and “None of the above”) to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models ( e.g. , Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD , a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%.
Claim
Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations—arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored.
Multimodal LLMs: Reasoning & Distillation¶
ACTIVE-o3 : Empowering MLLMs with Active Perception via Pure Reinforcement Learning Poster
Abstract
Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose Active-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, Active-o3 autonomously learns efficient and stable region selection strategies without explicit supervision. We further establish a comprehensive benchmark covering both open-world tasks (small/dense-object grounding) and domain-specific scenarios (remote sensing, autonomous driving, interactive segmentation). Experimental results demonstrate that Active-o3 significantly enhances active perception capabilities compared to Qwen2.5-VL-CoT. Moreover, we show that our RL framework not only preserves the model’s general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA. We hope that our work can provide a simple codebase and unified evaluation protocol to facilitate future research on active perception with MLLMs.
Claim
Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information.
Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions Poster
Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI , a novel KD framework designed from the perspective of T oken I nteractions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves 2.6% relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by 7.0%, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs.
Claim
Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges.
Multimodal Benchmarks¶
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge Poster
Abstract
Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.
Claim
Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings.
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Poster
Abstract
Large Multimodal Models (LMMs) exhibit shortfalls when interpreting images and, by some measures, have poorer spatial cognition than young children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by surging model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench—a lightweight visual reasoning benchmark curated using adversarial filtering to be “impossible” for frontier LMMs at release time, with initial SotA scores of 0% pass@1 and pass∧5. We track progress on ZeroBench over the subsequent year, observing SotA reaching 6% pass∧5 and 19% pass@5, indicating the potential longevity of our benchmark. Overall, we evaluate 46 LMMs on ZeroBench, compare performance to a human baseline, analyse strengths and weaknesses, and chart performance over a year of advancement in visual capabilities.
Claim
Large Multimodal Models (LMMs) exhibit shortfalls when interpreting images and, by some measures, have poorer spatial cognition than young children or animals.
MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge Poster
Abstract
As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.
Claim
As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge.