ML + Vision Top-6 Agent Survey - NeurIPS 2023 - Page 1 of 2¶

Venue: Neural Information Processing Systems
Year: 2023
Page: 1 / 2
Papers: 1-30 / 40

Papers

Reflexion: language agents with verbal reinforcement learning Paper

Authors: Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, Shunyu Yao
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-0377
Citations: 3719
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: language agents).
Code: Not found.
Extraction: method/data pending

Abstract

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

Claim

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Paper

Authors: Wenliang Dai, Junnan Li, Dongxu Li, A. Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, Steven C. H. Hoi
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2305.06500
Citations: 3492
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Claim

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Paper

Authors: Jiawei Liu, Chun Xia, Yuyao Wang, Lingming Zhang
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-0943
Citations: 1786
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, program synthesis, code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

Claim

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code.

Language Is Not All You Need: Aligning Perception with Language Models Paper

Authors: Shaohan Huang, Li Dong, Wenhui Wang, Y. Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, O. Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, et al.
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2302.14045
Citations: 760
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
Code: Not found.
Extraction: method/data pending

Abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Claim

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence.

On Evaluating Adversarial Robustness of Large Vision-Language Models Paper

Authors: Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2305.16934
Citations: 342
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision). To this end, we propose evaluating the robustness of open-source large VLMs in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP, and then transfer these adversarial examples to other VLMs such as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we observe that black-box queries on these VLMs can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses. Our findings provide a quantitative understanding regarding the adversarial vulnerability of large VLMs and call for a more thorough examination of their potential security flaws before deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.

Claim

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as ChatGPT.

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph Paper

Authors: Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, Xinchao Wang
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2309.13625
Citations: 118
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

Claim

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs.

Transformer-based Planning for Symbolic Regression Paper

Authors: P. Shojaee, Kazem Meidani, A. Farimani, Chandan K. Reddy
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2303.06833
Citations: 93
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, symbolic regression).
Code: Not found.
Extraction: method/data pending

Abstract

Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and offering notable advantages in terms of inference time over classical Genetic Programming (GP) methods. However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. Unlike conventional decoding strategies, TPSR enables the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the transformer-based equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise.

Claim

Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values.

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents Paper

Authors: Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, Brian Ichter
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-2606
Citations: 92
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
Code: Not found.
Extraction: method/data pending

Abstract

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models. The project's website can be found at grounded-decoding.github.io.

Claim

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models.

Stable and low-precision training for large-scale vision-language models Paper

Authors: Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari S. Morcos, Ali Farhadi, Ludwig Schmidt
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2304.13013
Citations: 88
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.

Claim

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models.

SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models Paper

Authors: Xiaosong Ma, Jie Zhang, Song Guo, Wenchao Xu
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-2847
Citations: 66
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Test-time adaptation (TTA) is a special and practical setting in unsupervised domain adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test data in another target domain. To avoid the computation-intensive backbone fine-tuning process, the zero-shot generalization potentials of the emerging pre-trained vision-language models (e.g., CLIP, CoOp) are leveraged to only tune the run-time prompt for unseen test domains. However, existing solutions have yet to fully exploit the representation capabilities of pre-trained models as they only focus on the entropy-based optimization and the performance is far below the supervised prompt adaptation methods, e.g., CoOp. In this paper, we propose SwapPrompt, a novel framework that can effectively leverage the self-supervised contrastive learning to facilitate the test-time prompt adaptation. SwapPrompt employs a dual prompts paradigm, i.e., an online prompt and a target prompt that averaged from the online prompt to retain historical information. In addition, SwapPrompt applies a swapped prediction mechanism, which takes advantage of the representation capabilities of pre-trained models to enhance the online prompt via contrastive learning. Specifically, we use the online prompt together with an augmented view of the input image to predict the class assignment generated by the target prompt together with an alternative augmented view of the same image. The proposed SwapPrompt can be easily deployed on vision-language models without additional requirement, and experimental results show that it achieves state-of-the-art test-time adaptation performance on ImageNet and nine other datasets. It is also shown that SwapPrompt can even achieve comparable performance with supervised prompt adaptation methods

Claim

Test-time adaptation (TTA) is a special and practical setting in unsupervised domain adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test data in another target domain.

Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation Paper

Authors: Jaemin Cho, Abhaysinh Zala, Mohit Bansal
Year: 2023
Venue: Neural Information Processing Systems
DOI: Not stated.
Citations: 60
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model Paper

Authors: Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2311.03774
Citations: 57
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.

Claim

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition.

EDGI: Equivariant Diffusion for Planning with Embodied Agents Paper

Authors: J. Brehmer, Joey Bose, P. D. Haan, Taco Cohen
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2303.12410
Citations: 49
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents).
Code: Not found.
Extraction: method/data pending

Abstract

Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group Z, and the object permutation group Sn. EDGI follows the Diffuser framework (Janner et al., 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3)xZxSn-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier guidance let us softly break the symmetry for specific tasks as needed. On object manipulation and navigation tasks, EDGI is substantially more sample efficient and generalizes better across the symmetry group than non-equivariant models.

Claim

Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries.

Towards Calibrated Robust Fine-Tuning of Vision-Language Models Paper

Authors: Changdae Oh, Mijoo Kim, Hyesu Lim, Dongyoon Han, Junhyeok Park, Euiseog Jeong, Zhi-Qi Cheng, Kyungwoo Song
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2311.01723
Citations: 45
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Improving out-of-distribution (OOD) generalization during in-distribution (ID) adaptation is a primary goal of robust fine-tuning of zero-shot models beyond naive fine-tuning. However, despite decent OOD generalization performance from recent robust fine-tuning methods, confidence calibration for reliable model output has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. Firstly, we show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data: 1) ID calibration error and 2) the smallest singular value of the ID input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further guided by the self-distillation of a moving-averaged model to achieve calibrated prediction as well. Starting from empirical evidence supporting our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our theorem and its practical implementation.

Claim

Improving out-of-distribution (OOD) generalization during in-distribution (ID) adaptation is a primary goal of robust fine-tuning of zero-shot models beyond naive fine-tuning.

Text Promptable Surgical Instrument Segmentation with Vision-Language Models Paper

Authors: Zijian Zhou, Oluwatosin O. Alabi, Meng Wei, Tom Kamiel Magda Vercauteren, Miaojing Shi
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2306.09244
Citations: 44
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

In this paper, we propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments in minimally invasive surgeries. We redefine the task as text promptable, thereby enabling a more nuanced comprehension of surgical instruments and adaptability to new instrument types. Inspired by recent advancements in vision-language models, we leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder consisting of attention- and convolution-based prompting schemes for surgical instrument segmentation prediction. Our model leverages multiple text prompts for each surgical instrument through a new mixture of prompts mechanism, resulting in enhanced segmentation performance. Additionally, we introduce a hard instrument area reinforcement module to improve image feature comprehension and segmentation precision. Extensive experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability. To our knowledge, this is the first implementation of a promptable approach to surgical instrument segmentation, offering significant potential for practical application in the field of robotic-assisted surgery.

Claim

In this paper, we propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments in minimally invasive surgeries.

Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models Paper

Authors: Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip H. S. Torr, Volker Tresp
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2306.02080
Citations: 42
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains. The robustness of these adaptation methods against distribution shifts have not been studied. In this study, we assess the robustness of 11 widely-used adaptation methods across 4 vision-language datasets under multimodal corruptions. Concretely, we introduce 7 benchmark datasets, including 96 visual and 87 textual corruptions, to investigate the robustness of different adaptation methods, the impact of available adaptation examples, and the influence of trainable parameter size during adaptation. Our analysis reveals that: 1) Adaptation methods are more sensitive to text corruptions than visual corruptions. 2) Full fine-tuning does not consistently provide the highest robustness; instead, adapters can achieve better robustness with comparable clean performance. 3) Contrary to expectations, our findings indicate that increasing the number of adaptation data and parameters does not guarantee enhanced robustness; instead it results in even lower robustness. We hope this study could benefit future research in the development of robust multimodal adaptation methods. The benchmark, code, and dataset used in this study can be accessed at https://adarobustness.github.io .

Claim

Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains.

Vocabulary-free Image Classification Paper

Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, E. Ricci
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2306.00917
Citations: 41
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

Claim

Recent advances in large vision-language models have revolutionized the image classification paradigm.

Robust Contrastive Language-Image Pretraining against Data Poisoning and Backdoor Attacks Paper

Authors: Wenhan Yang, Jingdong Gao, Baharan Mirzasoleiman
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-0469
Citations: 39
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models, alpha factor search (matched: vision language models, trading).
Code: Not found.
Extraction: method/data pending

Abstract

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose ROCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poisoning and backdoor attacks. ROCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs.It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that ROCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, ROCLIP decreases the success rate for targeted data poisoning attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while improving the model's linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, ROCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12.5%, while trading off the performance on some tasks.

Claim

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet.

Revisiting Few-Shot Object Detection with Vision-Language Models Paper

Authors: Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2312.14494
Citations: 39
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of"open-world"perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP! Our code and dataset splits are available at https://github.com/anishmadan23/foundational_fsod

Claim

The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of"open-world"perception.

Voila-A: Aligning Vision-Language Models with User's Gaze Attention Paper

Authors: Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2401.09454
Citations: 37
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset. Additionally, we innovate the Voila Perceiver modules to integrate gaze information into VLMs while preserving their pretrained knowledge. We evaluate Voila-A using a hold-out validation set and a newly collected VOILA-GAZE Testset, which features real-life scenarios captured with a gaze-tracking device. Our experimental results demonstrate that Voila-A significantly outperforms several baseline models. By aligning model attention with human gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and fosters engaging human-AI interaction across a wide range of applications.

Claim

In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs).

Alexa Arena: A User-Centric Interactive Platform for Embodied AI Paper

Authors: Qiaozi Gao, Govind Thattai, Xiaofeng Gao, Suhaila Shakiah, Shreyas Pansare, Vasu Sharma, G. Sukhatme, Hangjie Shi, Bo Yang, Desheng Zheng, Lucy Hu, Karthika Arumugam, et al.
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2303.01586
Citations: 33
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: embodied agents (matched: embodied agents, embodied ai).
Code: Not found.
Extraction: method/data pending

Abstract

We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus opening a new venue for high-efficiency HRI data collection and EAI system evaluation. Along with the platform, we introduce a dialog-enabled instruction-following benchmark and provide baseline results for it. We make Alexa Arena publicly available to facilitate research in building generalizable and assistive embodied agents.

Claim

We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research.

Learning Domain-Aware Detection Head with Prompt Tuning Paper

Authors: Haochen Li, Rui Zhang, Hantao Yao, Xinkai Song, Yifan Hao, Yongwei Zhao, Ling Li, Yunji Chen
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2306.05718
Citations: 32
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. However, existing methods focus on reducing the domain bias of the detection backbone by inferring a discriminative visual encoder, while ignoring the domain bias in the detection head. Inspired by the high generalization of vision-language models (VLMs), applying a VLM as the robust detection backbone following a domain-aware detection head is a reasonable way to learn the discriminative detector for each domain, rather than reducing the domain bias in traditional methods. To achieve the above issue, we thus propose a novel DAOD framework named Domain-Aware detection head with Prompt tuning (DA-Pro), which applies the learnable domain-adaptive prompt to generate the dynamic detection head for each domain. Formally, the domain-adaptive prompt consists of the domain-invariant tokens, domain-specific tokens, and the domain-related textual description along with the class label. Furthermore, two constraints between the source and target domains are applied to ensure that the domain-adaptive prompt can capture the domains-shared and domain-specific knowledge. A prompt ensemble strategy is also proposed to reduce the effect of prompt disturbance. Comprehensive experiments over multiple cross-domain adaptation tasks demonstrate that using the domain-adaptive prompt can produce an effectively domain-related detection head for boosting domain-adaptive object detection. Our code is available at https://github.com/Therock90421/DA-Pro.

Claim

Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain.

TradeMaster: A Holistic Quantitative Trading Platform Empowered by Reinforcement Learning Paper

Authors: Shuo Sun, Molei Qin, Wentao Zhang, Haochong Xia, Chuqiao Zong, Jie Ying, Yonggang Xie, Lingxuan Zhao, Xinrun Wang, Bo An
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-2576
Citations: 29
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: alpha factor search (matched: trading).
Code: Not found.
Extraction: method/data pending

Abstract

The financial markets, which involve over $90 trillion market capitals, attract the attention of innumerable profit-seeking investors globally. Recent explosion of reinforcement learning in financial trading (RLFT) research has shown stellar performance on many quantitative trading tasks. However, it is still challenging to deploy reinforcement learning (RL) methods into real-world financial markets due to the highly composite nature of this domain, which entails design choices and interactions between components that collect financial data, conduct feature engineering, build market environments, make investment decisions, evaluate model behaviors and provides user interfaces. Despite the availability of abundant financial data and advanced RL techniques, a remarkable gap still exists between the potential and realized utilization of RL in financial trading. In particular, orchestrating an RLFT project lifecycle poses challenges in engineering (i.e., hard to build), benchmarking (i.e., hard to compare) and usability (i.e., hard to optimize, maintain and use). To overcome these challenges, we introduce TradeMaster, a holistic open-source RLFT platform that serves as a i) software toolkit, ii) empirical benchmark, and iii) user interface. Our ultimate goal is to provide infrastructures for transparent and reproducible RLFT research and facilitate their real-world deployment with industry impact. TradeMaster will be updated continuously and welcomes contributions from both RL and finance communities.

Claim

The financial markets, which involve over $90 trillion market capitals, attract the attention of innumerable profit-seeking investors globally.

LOVM: Language-Only Vision Model Selection Paper

Authors: O. Zohar, Shih-Cheng Huang, Kuan Wang, Serena Yeung
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2306.08893
Citations: 20
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.

Claim

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings.

On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes Paper

Authors: J. Hau, E. Delage, M. Ghavamzadeh, Marek Petrik
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-2254
Citations: 18
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms. Dynamic programming decompositions that augment the state space with discrete risk levels have recently gained popularity in the RL community. Prior work has shown that these decompositions are optimal when the risk level is discretized sufficiently. However, we show that these popular decompositions for Conditional-Value-at-Risk (CVaR) and Entropic-Value-at-Risk (EVaR) are inherently suboptimal regardless of the discretization level. In particular, we show that a saddle point property assumed to hold in prior literature may be violated. However, a decomposition does hold for Value-at-Risk and our proof demonstrates how this risk measure differs from CVaR and EVaR. Our findings are significant because risk-averse algorithms are used in high-stake environments, making their correctness much more critical.

Claim

Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms.

ANPL: Towards Natural Programming with Interactive Decomposition Paper

Authors: Di Huang, Ziyuan Nan, Xingui Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, Yunji Chen
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.52202/075280-3040
Citations: 18
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, program synthesis).
Code: Not found.
Extraction: method/data pending

Abstract

Though LLMs are capable of generating plausible programs, it's challenging to interact with the LLMs further to revise the program, especially if the user's specific requirements are different from the initial proposal. In this paper, we introduce ANPL, an interactive programming system that ensures users can always refine the generated code towards their specific programmatic intents via structured decompositions. Borrowing the paradigm of sketching from program synthesis, an ANPL program consists of a set of input-outputs that it must satisfy, a “sketch” -- control/data flow expressed in precise code (e.g. Python), and “holes” -- sub-modules to be implemented by the LLM specified with natural language. The user revises an ANPL program by either modifying the sketch, changing the language used to describe the holes, or providing additional input-outputs to a particular hole, turning it into a sub-ANPL program that can be solved recursively. This workflow allows the users to offload programming burdens to the LLM as much as possible while retaining the ability to pinpoint and resolve bugs locally, without exposing the rest of the program to the LLM. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. Additional evaluations on APPS, HumanEval, and real-world programming tasks have validated that the ANPL framework is applicable to multiple programming domains. We release the ANPL solutions to the ARC tasks as a dataset, providing insights into how humans decompose novel tasks programmatically. See our code at https://iprc-dip.github.io/ANPL/.

Claim

Though LLMs are capable of generating plausible programs, it's challenging to interact with the LLMs further to revise the program, especially if the user's specific requirements are different from the initial proposal.

Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach Paper

Authors: Fabian Zaiser, A. Murawski, Luke Ong
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2305.17058
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to many discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct, fully automated and uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that its performance on a range of real-world examples is competitive with approximate Monte Carlo methods, while avoiding approximation errors.

Claim

We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to many discrete inference problems, even with infinite support and continuous priors.

Parts of Speech-Grounded Subspaces in Vision-Language Models Paper

Authors: James Oldfield, Christos Tzelepis, Yannis Panagakis, M. Nicolaou, I. Patras
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2305.14053
Citations: 12
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What's more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists' painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists' styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.

Claim

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks.

Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness Paper

Authors: C. Tsai, Ying-Ting Lin, Yen-Huan Li
Year: 2023
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2305.13946
Citations: 12
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: alpha factor search (matched: portfolio).
Code: Not found.
Extraction: method/data pending

Abstract

This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is"easy,"with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.

Claim

This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses.