ML + Vision Top-6 Agent Survey - ICLR 2025 - Page 2 of 3

  • Venue: International Conference on Learning Representations
  • Year: 2025
  • Page: 2 / 3
  • Papers: 31-60 / 74
Noisy Test-Time Adaptation in Vision-Language Models Paper
  • Authors: Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2502.14604
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of \(8.32%\) in harmonic mean accuracy (\(\text{Acc}_\text{H}\)) for ZS-NTTA and \(9.40%\) in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.

Claim

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing.

RAG-SR: Retrieval-Augmented Generation for Neural Symbolic Regression Paper
  • Authors: Hengzhe Zhang, Qi Chen, Bing Xue, Wolfgang Banzhaf, Mengjie Zhang
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 9
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: symbolic regression).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Empowering LLM Agents with Zero-Shot Optimal Decision-Making through Q-learning Paper
  • Authors: Jiajun Chai, Sicheng Li, Yuqian Fu, Dongbin Zhao, Yuanheng Zhu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 8
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Differentiable Integer Linear Programming Paper
  • Authors: Zijie Geng, Jie Wang, Xijun Li, Fangzhou Zhu, Jianye Hao, Bin Li, Feng Wu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 8
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models Paper
  • Authors: Seonghwan Park, J. Jeong, Yongjun Kim, Jaeho Lee, Namhoon Lee
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2504.06838
  • Citations: 7
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT). While BBPT has demonstrated considerable potential, it is often found that many existing methods require an excessive number of queries (i.e., function evaluations), which poses a significant challenge in real-world scenarios where the number of allowed queries is limited. To tackle this issue, we propose Zeroth-order Intrinsic-dimensional Prompt-tuning (ZIP), a novel approach that enables efficient and robust prompt optimization in a purely black-box setting. The key idea of ZIP is to reduce the problem dimensionality and the variance of zeroth-order gradient estimates, such that the training is done fast with far less queries. We achieve this by re-parameterizing prompts in low-rank representations and designing intrinsic-dimensional clipping of estimated gradients. We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.

Claim

Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT).

An Online Learning Theory of Trading-Volume Maximization Paper
  • Authors: Tommaso Cesari, Roberto Colomboni
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 7
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: alpha factor search (matched: trading).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Attribute-based Visual Reprogramming for Vision-Language Models Paper
  • Authors: C. Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 6
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the \(k\)-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.

Claim

Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs.

Large (Vision) Language Models are Unsupervised In-Context Learners Paper
  • Authors: Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbić
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2504.02349
  • Citations: 6
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

Claim

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training.

When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs" for Human-AI Interaction Paper
  • Authors: Zhenchang Xing, Yang Liu, Z. Cheng, Qing Huang, Dehai Zhao, Daniel Sun, Chenhua Liu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2508.06942
  • Citations: 6
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, code generation, software engineering).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

With the growing capabilities of large language models (LLMs), they are increasingly applied in areas like intelligent customer service, code generation, and knowledge management. Natural language (NL) prompts act as the “APIs”for human-LLM interaction. To improve prompt quality, best practices for prompt engineering (PE) have been developed, including writing guidelines and templates. Building on this, we propose Controlled NL for Prompt (CNL-P), which not only incorporates PE best practices but also draws on key principles from software engineering (SE). CNL-P introduces precise grammar structures and strict semantic norms, further eliminating NL's ambiguity, allowing for a declarative but structured and accurate expression of user intent. This helps LLMs better interpret and execute the prompts, leading to more consistent and higher-quality outputs. We also introduce an NL2CNL-P conversion tool based on LLMs, enabling users to write prompts in NL, which are then transformed into CNL-P format, thus lowering the learning curve of CNL-P. In particular, we develop a linting tool that checks CNL-P prompts for syntactic and semantic accuracy, applying static analysis techniques to NL for the first time. Extensive experiments demonstrate that CNL-P enhances the quality of LLM responses through the novel and organic synergy of PE and SE. We believe that CNL-P can bridge the gap between emerging PE and traditional SE, laying the foundation for a new programming paradigm centered around NL.

Claim

With the growing capabilities of large language models (LLMs), they are increasingly applied in areas like intelligent customer service, code generation, and knowledge management.

Grounding Multimodal Large Language Model in GUI World Paper
  • Authors: Weixian Lei, Difei Gao, Mike Zheng Shou
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 6
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

VLMaterial: Procedural Material Generation with Large Vision-Language Models Paper
  • Authors: Beichen Li, Rundi Wu, Armando Solar-Lezama, Changxi Zheng, Liang Shi, Bernd Bickel, Wojciech Matusik
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2501.18623
  • Citations: 5
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Procedural materials, represented as functional node graphs, are ubiquitous in computer graphics for photorealistic material appearance design. They allow users to perform intuitive and precise editing to achieve desired visual appearances. However, creating a procedural material given an input image requires professional knowledge and significant effort. In this work, we leverage the ability to convert procedural materials into standard Python programs and fine-tune a large pre-trained vision-language model (VLM) to generate such programs from input images. To enable effective fine-tuning, we also contribute an open-source procedural material dataset and propose to perform program-level augmentation by prompting another pre-trained large language model (LLM). Through extensive evaluation, we show that our method outperforms previous methods on both synthetic and real-world examples.

Claim

Procedural materials, represented as functional node graphs, are ubiquitous in computer graphics for photorealistic material appearance design.

RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning Paper
  • Authors: Chenglong Kang, Xiaoyi Liu, Fei Guo
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 5
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

HASARD: A Benchmark for Vision-Based Safe Reinforcement Learning in Embodied Agents Paper
  • Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2503.08241
  • Citations: 4
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Advancing safe autonomous systems through reinforcement learning (RL) requires robust benchmarks to evaluate performance, analyze methods, and assess agent competencies. Humans primarily rely on embodied visual perception to safely navigate and interact with their surroundings, making it a valuable capability for RL agents. However, existing vision-based 3D benchmarks only consider simple navigation tasks. To address this shortcoming, we introduce HASARD, a suite of diverse and complex tasks to \(**HA**\)rness \(**SA**\)fe \(**R**\)L with \(**D**\)oom, requiring strategic decision-making, comprehending spatial relationships, and predicting the short-term future. HASARD features three difficulty levels and two action spaces. An empirical evaluation of popular baseline methods demonstrates the benchmark's complexity, unique challenges, and reward-cost trade-offs. Visualizing agent navigation during training with top-down heatmaps provides insight into a method's learning process. Incrementally training across difficulty levels offers an implicit learning curriculum. HASARD is the first safe RL benchmark to exclusively target egocentric vision-based learning, offering a cost-effective and insightful way to explore the potential and boundaries of current and future safe RL methods. The environments and baseline implementations are open-sourced at https://sites.google.com/view/hasard-bench/.

Claim

Advancing safe autonomous systems through reinforcement learning (RL) requires robust benchmarks to evaluate performance, analyze methods, and assess agent competencies.

The KoLMogorov Test: Compression by Code Generation Paper
  • Authors: Ori Yoran, Kunhao Zheng, Fabian Gloeckle, Jonas Gehring, Gabriel Synnaeve, Taco Cohen
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2503.13992
  • Citations: 4
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.

Claim

Compression is at the heart of intelligence.

Discriminator-Guided Embodied Planning for LLM Agent Paper
  • Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 4
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Execution-guided within-prompt search for programming-by-example Paper
  • Authors: Gust Verbruggen, Ashish Tiwari, Mukul Singh, Vu Le, Sumit Gulwani
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 4
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection Paper
  • Authors: Yuguang Yang, Tongfei Chen, Haoyu Huang, Linlin Yang, Chunyu Xie, D. Leng, Xianbin Cao, Baochang Zhang
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2502.16223
  • Citations: 3
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value. Recent studies leverage grounded vision-language models (GLIP) to achieve this by using detailed disease descriptions as prompts for the target disease name during the inference phase. However, these methods typically treat prompts as equivalent context to the target name, making it difficult to assign specific disease knowledge based on visual information, leading to a coarse alignment between images and target descriptions. In this paper, we propose StructuralGLIP, which introduces an auxiliary branch to encode prompts into a latent knowledge bank layer-by-layer, enabling more context-aware and fine-grained alignment. Specifically, in each layer, we select highly similar features from both the image representation and the knowledge bank, forming structural representations that capture nuanced relationships between image patches and target descriptions. These features are then fused across modalities to further enhance detection performance. Extensive experiments demonstrate that StructuralGLIP achieves a +4.1% AP improvement over prior state-of-the-art methods across seven zero-shot medical detection benchmarks, and consistently improves fine-tuned models by +3.2% AP on endoscopy image datasets.

Claim

Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value.

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning Paper
  • Authors: N. Yilmaz, Maitreya Patel, Yiran Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2503.00043
  • Citations: 3
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs'perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.

Claim

Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.

Towards Explaining the Power of Constant-depth Graph Neural Networks for Structured Linear Programming Paper
  • Authors: Qian Li, Minghui Ouyang, Tian Ding, Yuyi Wang, Qingjiang Shi, Ruoyu Sun
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 3
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs Paper
  • Authors: Somesh Singh, Harini S.I., Yaman Kumar Singla, Changyou Chen, R. Shah, V. Baths, Balaji Krishnamurthy
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 3
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

KinFormer: Generalizable Dynamical Symbolic Regression for Catalytic Organic Reaction Kinetics Paper
  • Authors: Jindou Chen, Jidong Tian, Liang Wu, ChenXinWei, Xiaokang Yang, Yaohui Jin, Yanyan Xu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 3
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: symbolic regression).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision Paper
  • Authors: Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2504.11754
  • Citations: 2
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

Claim

We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision.

Should VLMs be Pre-trained with Image Data? Paper
  • Authors: Sedrick Scott Keh, Jean-Pierre Mercat, S. Gadre, K. Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, L. Schmidt, Achal Dave
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2503.07603
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

Claim

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks.

Deep Distributed Optimization for Large-Scale Quadratic Programming Paper
  • Authors: A. Saravanos, Hunter Kuperman, Alex Oshin, Arshiya Taj Abdul, Vincent Pacelli, Evangelos A. Theodorou
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Unhackable Temporal Reward for Scalable Video MLLMs Paper
  • Authors: En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jian‐Yuan Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Semantic Temporal Abstraction via Vision-Language Model Guidance for Efficient Reinforcement Learning Paper
  • Authors: Tian-Shuo Liu, Xu-Hui Liu, Ruifeng Chen, Lixuan Jin, Pengyuan Wang, Zhilong Zhang, Yang Yu
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 1
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Exploring the Design Space of Visual Context Representation in Video MLLMs Paper
  • Authors: Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models Paper
  • Authors: Pei Wang, Yanan Wu, Noah Wang, Jiaheng Liu, Xiaoshuai Song, Z. Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, Ge Zhang, Hangyu Guo, et al.
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: tool use).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

ScImage: How good are multimodal large language models at scientific text-to-image generation? Paper
  • Authors: Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Fahimeh Moafian, Zhixue Zhao
  • Year: 2025
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 0
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.