ML + Vision Top-6 Agent Survey - ICLR 2023 - Page 1 of 2

  • Venue: International Conference on Learning Representations
  • Year: 2023
  • Page: 1 / 2
  • Papers: 1-30 / 45
Building Cooperative Embodied Agents Modularly with Large Language Models Paper
  • Authors: Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, J. Tenenbaum, Tianmin Shu, Chuang Gan
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2307.02485
  • Citations: 304
  • Relevance: 5 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents, embodied agents (matched: llm agents, language agent, embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoELA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://vis-www.cs.umass.edu/Co-LLM-Agents/.

Claim

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments.

Lemur: Harmonizing Natural Language and Code for Language Agents Paper
  • Authors: Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, et al.
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.06830
  • Citations: 94
  • Relevance: 5 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents, program synthesis (matched: language agents, programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to functional language agents demands that models not only master human interaction, reasoning, and planning but also ensure grounding in the relevant environments. This calls for a harmonious blend of language and coding capabilities in the models. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pre-training using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks among open-source models. Comprehensive experiments demonstrate Lemur's superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. https://github.com/OpenLemur/Lemur

Claim

We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents.

Kosmos-2: Grounding Multimodal Large Language Models to the World Paper
  • Authors: Zhiliang Peng, Wenhui Wang, Li Dong, Y. Hao, Shaohan Huang, Shuming Ma, Furu Wei
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2306.14824
  • Citations: 1195
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as Markdown-style text spans linked to bounding boxes, where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.

Claim

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.

Ferret: Refer and Ground Anything Anywhere at Any Granularity Paper
  • Authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.07704
  • Citations: 535
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret

Claim

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis Paper
  • Authors: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, D. Eck, Aleksandra Faust
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2307.12856
  • Citations: 373
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: program synthesis, code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

Claim

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation.

Identifying the Risks of LM Agents with an LM-Emulated Sandbox Paper
  • Authors: Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2309.15817
  • Citations: 344
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: tool use, lm agents, lm agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.

Claim

Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses.

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models Paper
  • Authors: Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, Huaxiu Yao
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.00754
  • Citations: 329
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, lvlm, large vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.

Claim

Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.

DreamLLM: Synergistic Multimodal Comprehension and Creation Paper
  • Authors: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jian‐Yuan Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, et al.
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2309.11499
  • Citations: 324
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io.

Claim

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation.

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents Paper
  • Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.11667
  • Citations: 323
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: language agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

Claim

Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence.

Planning with Large Language Models for Code Generation Paper
  • Authors: Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, J. Tenenbaum, Chuang Gan
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2303.05510
  • Citations: 246
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner to generate candidate programs and test them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective.

Claim

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Paper
  • Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2309.07915
  • Citations: 207
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC

Claim

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.

Is Self-Repair a Silver Bullet for Code Generation? Paper
  • Authors: Theo X. Olausson, J. Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 205
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging.

Claim

Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks.

Guiding Instruction-based Image Editing via Multimodal Large Language Models Paper
  • Authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2309.17102
  • Citations: 178
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Claim

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks.

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning Paper
  • Authors: Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.12921
  • Citations: 162
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second"baseline"prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Claim

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive.

Multimodal Web Navigation with Instruction-Finetuned Foundation Models Paper
  • Authors: Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, S. Gu, Izzeddin Gur
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2305.11854
  • Citations: 160
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: multimodal agents, computer-use agents (matched: multimodal agent, web agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

Claim

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data.

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization Paper
  • Authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, P. Mùi, et al.
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2308.02151
  • Citations: 126
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: LLM agents (matched: language agents, language agent).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.

Claim

Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users.

Vision-by-Language for Training-Free Compositional Image Retrieval Paper
  • Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.09291
  • Citations: 126
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text"without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Claim

Given an image and a target modification (e.g an image of the Eiffel tower and the text"without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.

Controlling Vision-Language Models for Multi-Task Image Restoration Paper
  • Authors: Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjolund, Thomas B. Schon
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 125
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both degradation-specific and unified image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.

Claim

Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions.

Kosmos-G: Generating Images in Context with Multimodal Large Language Models Paper
  • Authors: Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.02992
  • Citations: 110
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of"image as a foreign language in image generation."This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of"image as a foreign language in image generation."The code can be found at https://aka.ms/Kosmos-G

Claim

Recent advancements in subject-driven image generation have made significant strides.

Tag2Text: Guiding Vision-Language Model via Image Tagging Paper
  • Authors: Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Lei Zhang
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2303.05657
  • Citations: 109
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text and thus provides a strong semantic guidance to vision-language models. In this way, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text demonstrates the ability of a foundational image tagging model, with superior zero-shot performance even comparable to fully supervised models. Moreover, by leveraging the tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance. Code, demo and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

Claim

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features.

Consistency-guided Prompt Learning for Vision-Language Models Paper
  • Authors: Shuvendu Roy, Ali Etemad
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2306.01195
  • Citations: 106
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models. Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input serves to further regularize the consistency constraint, thereby improving generalization. Moreover, the integration of adapters and prompts not only enhances performance on downstream tasks but also offers increased tuning flexibility in both input and output spaces. This facilitates more effective adaptation to downstream tasks in a few-shot learning setting. Experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation. On generalization, CoPrompt improves the state-of-the-art on zero-shot tasks and the overall harmonic mean over 11 datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. We make our code available at https://github.com/ShuvenduRoy/CoPrompt.

Claim

We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models.

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment Paper
  • Authors: Utkarsh Mall, Cheng Perng Phoo, Meilin Liu, Carl Vondrick, B. Hariharan, Kavita Bala
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2312.06960
  • Citations: 96
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vision language model).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.

Claim

We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules Paper
  • Authors: Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, Shafiq R. Joty
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.08992
  • Citations: 77
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, code generation).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

Claim

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks.

Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model Paper
  • Authors: Zhihai Wang, Xijun Li, Jie Wang, Yufei Kuang, M. Yuan, Jianguo Zeng, Yongdong Zhang, Feng Wu
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2302.00244
  • Citations: 63
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection -- which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs -- heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model -- that formulates the cut selection task as a sequence to sequence learning problem -- to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.

Claim

Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications.

ODEFormer: Symbolic Regression of Dynamical Systems with Transformers Paper
  • Authors: Stéphane d’Ascoli, Soren Becker, Alexander Mathis, P. Schwaller, Niki Kilbertus
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.05573
  • Citations: 61
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: symbolic regression).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

We introduce ODEFormer, the first transformer able to infer multidimensional ordinary differential equation (ODE) systems in symbolic form from the observation of a single solution trajectory. We perform extensive evaluations on two datasets: (i) the existing"Strogatz"dataset featuring two-dimensional systems; (ii) ODEBench, a collection of one- to four-dimensional systems that we carefully curated from the literature to provide a more holistic benchmark. ODEFormer consistently outperforms existing methods while displaying substantially improved robustness to noisy and irregularly sampled observations, as well as faster inference. We release our code, model and benchmark dataset publicly.

Claim

We introduce ODEFormer, the first transformer able to infer multidimensional ordinary differential equation (ODE) systems in symbolic form from the observation of a single solution trajectory.

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds Paper
  • Authors: Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2310.13255
  • Citations: 57
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to"a blindfolded text-based game."Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.

Claim

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics.

A GNN-Guided Predict-and-Search Framework for Mixed-Integer Linear Programming Paper
  • Authors: Qingyu Han, L. Yang, Qian Chen, Xiang Zhou, Dong Zhang, Akang Wang, Ruoyu Sun, Xiaodong Luo
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 54
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Mixed-integer linear programming (MILP) is widely employed for modeling combinatorial optimization problems. In practice, similar MILP instances with only coefficient variations are routinely solved, and machine learning (ML) algorithms are capable of capturing common patterns across these MILP instances. In this work, we combine ML with optimization and propose a novel predict-and-search framework for efficiently identifying high-quality feasible solutions. Specifically, we first utilize graph neural networks to predict the marginal probability of each variable, and then search for the best feasible solution within a properly defined ball around the predicted solution. We conduct extensive experiments on public datasets, and computational results demonstrate that our proposed framework achieves 51.1% and 9.9% performance improvements to MILP solvers SCIP and Gurobi on primal gaps, respectively.

Claim

Mixed-integer linear programming (MILP) is widely employed for modeling combinatorial optimization problems.

Deep Generative Symbolic Regression Paper
  • Authors: Samuel Holt, Zhaozhi Qian, M. Schaar
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2401.00282
  • Citations: 51
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: symbolic regression).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.

Claim

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery.

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models Paper
  • Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yezhou Yang
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: 10.48550/arXiv.2305.18010
  • Citations: 49
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

One fascinating aspect of pre-trained vision-language models (VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation (TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed reinforcement learning with CLIP feedback (RLCF) framework is highly flexible and universal. Beyond the classification task, with task-specific sampling strategies and a proper reward baseline choice, RLCF can be easily extended to not only discrimination tasks like retrieval but also generalization tasks like image captioning, improving the zero-shot generalization capacity of VLMs. According to the characteristics of these VL tasks, we build different fully TTA pipelines with RLCF to improve the zero-shot generalization ability of various VLMs. Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. The code is available at https://github.com/mzhaoshuai/RLCF.

Claim

One fascinating aspect of pre-trained vision-language models (VLMs) learning under language supervision is their impressive zero-shot generalization capability.

Transformer-based model for symbolic regression via joint supervised learning Paper
  • Authors: Wenqiang Li, Weijun Li, Linjun Sun, Min Wu, Lina Yu, Jingyi Liu, Yanjie Li, Song Tian
  • Year: 2023
  • Venue: International Conference on Learning Representations
  • DOI: Not stated.
  • Citations: 40
  • Relevance: 3 / 5
  • Why selected: Heuristic keyword/alias matches: program synthesis (matched: symbolic regression).
  • Code: Not found.
  • Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.