ML + Vision Top-6 Agent Survey - NeurIPS 2024 - Page 5 of 6¶

Venue: Neural Information Processing Systems
Year: 2024
Page: 5 / 6
Papers: 121-150 / 167

Papers

Enhancing vision-language models for medical imaging: bridging the 3D gap with innovative slice selection Paper

Authors: Yuli Wang, Peng Jian, Yuwei Dai, Craig K. Jones, Haris I. Sair, Jinglai Shen, Nicolas Loizou, Jing Wu, Wen-Chi Hsu, Maliha R. Imami, Zhicheng Jiao, Paul Zhang, et al.
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-3171
Citations: 20
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Recent approaches to vision-language tasks are built on the remarkable capabilities of large vision-language models (VLMs). These models excel in zero-shot and few-shot learning, enabling them to learn new tasks without parameter updates. However, their primary challenge lies in their design, which primarily accommodates 2D input, thus limiting their effectiveness for medical images, particularly radiological images like MRI and CT, which are typically 3D. To bridge the gap between state-of-the-art 2D VLMs and 3D medical image data, we developed an innovative, one-pass, unsupervised representative slice selection method called Vote-MI, which selects representative 2D slices from 3D medical imaging. To evaluate the effectiveness of Vote-MI when implemented with VLMs, we introduce BrainMD, a robust, multimodal dataset comprising 2,453 annotated 3D MRI brain scans with corresponding textual radiology reports and electronic health records. Based on BrainMD, we further develop two benchmarks, BrainMD-select (including the most representative 2D slice of a 3D image) and BrainBench (including various vision-language downstream tasks). Extensive experiments on the BrainMD dataset and its two corresponding benchmarks demonstrate that our representative selection method signiﬁcantly improves performance in zero-shot and few-shot learning tasks. On average, Vote-MI achieves a 14.6% and 16.6% absolute gain for zero-shot and few-shot learning, respectively, compared to randomly selecting examples. Our studies represent a signiﬁcant step toward integrating AI in medical imaging to enhance patient care and facilitate medical research. We hope this work will serve as a foundation for data selection as vision-language models are increasingly applied to new tasks. Code and data examples are available

Claim

Recent approaches to vision-language tasks are built on the remarkable capabilities of large vision-language models (VLMs).

Boosting Text-to-Video Generative Model with MLLMs Feedback Paper

Authors: Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, Furu Wei
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-4426
Citations: 20
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advancements in text-to-video generative models, such as Sora [3], have showcased impressive capabilities. These models have attracted significant interest for their potential applications. However, they often rely on extensive datasets of variable quality, which can result in generated videos that lack aesthetic appeal and do not accurately reflect the input text prompts. A promising approach to mitigate these issues is to leverage Reinforcement Learning from Human Feedback (RLHF), which aims to align the outputs of text-to-video models with human preferences. However, the considerable costs associated with manual annotation have led to a scarcity of comprehensive preference datasets. In response to this challenge, our study begins by investigating the efficacy of Multimodal Large Language Models (MLLMs) generated annotations in capturing video preferences, discovering a high degree of concordance with human judgments. Building upon this finding, we utilize MLLMs to perform fine-grained video preference annotations across two dimensions, resulting in the creation of V IDEO P REFER , which includes 135,000 preference annotations. Utilizing this dataset, we introduce V IDEO RM, the first general-purpose reward model tailored for video preference in the text-to-video domain. Our comprehensive experiments confirm the effectiveness of both V IDEO - P REFER and V IDEO RM, representing a significant step forward in the field.

Claim

Recent advancements in text-to-video generative models, such as Sora [3], have showcased impressive capabilities.

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions Paper

Authors: Guanyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Tianxing Zhou, Zicai Peng, Mengxiao Hu, Haizhou Li, Yuan Li, Yi Yang, Yufeng Yue
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2410.20927
Citations: 19
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite the progress, current VIL methods naively employ VLMs to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck. In this work, we present VLMimic, a novel paradigm that harnesses VLMs to directly learn even fine-grained action levels, only given a limited number of human videos. Specifically, VLMimic first grounds object-centric movements from human videos, and learns skills using hierarchical constraint representations, facilitating the derivation of skills with fine-grained action levels from limited human videos. These skills are refined and updated through an iterative comparison strategy, enabling efficient adaptation to unseen environments. Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by over 37% in long-horizon tasks.

Claim

Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills.

Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning Paper

Authors: Hang Zhou, Yehui Tang, Haochen Qin, Yujie Yang, Renren Jin, Deyi Xiong, Kai Han, Yunhe Wang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.14497
Citations: 19
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: LLM agents (matched: llm agents).
Code: Not found.
Extraction: method/data pending

Abstract

The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high-quality and diverse data is both expensive and time-consuming. To mitigate this issue, we propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets through multi-agent collaboration and assessment. The framework adopts a three-pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12% and notable gains in specific metrics, such as a 40% improvement in Fermi, as evidenced by benchmarks like MT-bench, Vicuna bench, and WizardLM testset.

Claim

The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data.

Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models Paper

Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2410.08611
Citations: 18
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can"make up"OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95. Codes are available in https://github.com/MengyuanChen21/NeurIPS2024-CSP.

Claim

A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels.

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model Paper

Authors: Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Chenhui Li, Yang Li, Changbo Wang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.01756
Citations: 17
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm).
Code: Not found.
Extraction: method/data pending

Abstract

Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language (VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.

Claim

Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box.

GraphVis: Boosting LLMs with Visual Knowledge Graph Integration Paper

Authors: Yihe Deng, Chenchen Ye, Zijie Huang, Mingyu Derek Ma, Y. Kou, Wei Wang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-2155
Citations: 16
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

The rapid evolution of large language models (LLMs) has expanded their capabilities across various data modalities, extending from well-established image data to increasingly popular graph data. Given the limitation of LLMs in hallucinations and inaccuracies in recalling factual knowledge, Knowledge Graph (KG) has emerged as a crucial data modality to support more accurate reasoning by LLMs. However, integrating structured knowledge from KGs into LLMs remains challenging, as most KG-enhanced LLM methods directly convert the KG into linearized text triples, which is not as expressive as the original structured data. To address this, we introduce GraphVis , which conserves the intricate graph structure through the visual modality to enhance the comprehension of KGs with the aid of Large Vision Language Models (LVLMs). Our approach incorporates a unique curriculum fine-tuning scheme which first instructs LVLMs to recognize basic graphical features from the images, and subsequently incorporates reasoning on QA tasks with the visual graphs. This cross-modal methodology not only markedly enhances performance on standard textual QA but also shows improved zero-shot VQA performance by utilizing synthetic graph images to augment the data for VQA tasks. We present comprehensive evaluations across commonsense reasoning QA benchmarks, where GraphVis provides an average improvement of 11 . 1% over its base model and outperforms existing KG-enhanced LLM approaches. Across VQA benchmarks such as ScienceQA that share similar scientific diagram images, GraphVis provides a notable gain of 4 . 32% . Code is made available on GitHub.

Claim

The rapid evolution of large language models (LLMs) has expanded their capabilities across various data modalities, extending from well-established image data to increasingly popular graph data.

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control Paper

Authors: Gunshi Gupta, Karmesh Yadav, Y. Gal, Dhruv Batra, Z. Kira, Cong Lu, Tim G. J. Rudner
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2405.05852
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models, embodied agents (matched: vision language models, embodied agents, embodied ai).
Code: Not found.
Extraction: method/data pending

Abstract

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.

Claim

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models Paper

Authors: Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, A. Steiner, Xiao-Qi Zhai, Ibrahim M. Alabdulmohsin
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2405.13777
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.

Claim

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs).

Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages Paper

Authors: Federico Mora, Justin Wong, Haley Lepe, Sahil Bhatia, Karim Elmaaroufi, George Varghese, Joseph Gonzalez, Elizabeth Polgreen, S. Seshia
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-3339
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Recent advances in large language models (LLMs) for code applications have demonstrated remarkable zero-shot fluency and instruction following on challenging code related tasks ranging from test case generation to self-repair. Unsurprisingly, however, models struggle to compose syntactically valid programs in programming languages unrepresented in pre-training, referred to as very low-resource Programming Languages (VLPLs). VLPLs appear in crucial settings, including domain-specific languages for internal tools, tool-chains for legacy languages, and formal verification frameworks. Inspired by a technique called natural programming elicitation, we propose designing an intermediate language that LLMs"naturally"know how to use and which can be automatically compiled to a target VLPL. When LLMs generate code that lies outside of this intermediate language, we use compiler techniques to repair the code into programs in the intermediate language. Overall, we introduce synthetic programming elicitation and compilation (SPEAC), an approach that enables LLMs to generate syntactically valid code even for VLPLs. We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language and find that, compared to existing retrieval and fine-tuning baselines, SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.

Claim

Recent advances in large language models (LLMs) for code applications have demonstrated remarkable zero-shot fluency and instruction following on challenging code related tasks ranging from test case generation to self-repair.

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation Paper

Authors: Nachiket Kotalwar, Alkis Gotovos, A. Singla
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2406.05053
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors' quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM's in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.

Claim

Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners.

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Paper

Authors: M. Maruf, Arka Daw, Kazi Sajeed Mehrab, Harish Babu Manogaran, Abhilash Neog, Medha Sawhney, Mridul Khurana, J. Balhoff, Yasin Bakiş, B. Altıntaş, Matthew J. Thompson, Elizabeth G. Campolongo, et al.
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2408.16176
Citations: 15
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio.

Claim

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs).

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective Paper

Authors: Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2410.12816
Citations: 14
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vision language model).
Code: Not found.
Extraction: method/data pending

Abstract

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.

Claim

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks.

Right this way: Can VLMs Guide Us to See More to Answer Questions? Paper

Authors: Li Liu, Diji Yang, Sijia Zhong, Kalyana Suma Sree Tholeti, Lei Ding, Yi Zhang, Leilani Gilpin
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.00394
Citations: 14
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating “where to know” scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.

Claim

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer.

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models Paper

Authors: Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, Wenjun Zeng
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2407.18999
Citations: 13
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllm, mllms).
Code: Not found.
Extraction: method/data pending

Abstract

Disentangled representation learning (DRL) aims to identify and decompose underlying factors behind observations, thus facilitating data perception and generation. However, current DRL approaches often rely on the unrealistic assumption that semantic factors are statistically independent. In reality, these factors may exhibit correlations, which off-the-shelf solutions have yet to properly address. To tackle this challenge, we introduce a bidirectional weighted graph-based framework, to learn factorized attributes and their interrelations within complex data. Specifically, we propose a \(\beta\)-VAE based module to extract factors as the initial nodes of the graph, and leverage the multimodal large language model (MLLM) to discover and rank latent correlations, thereby updating the weighted edges. By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement. Experiments demonstrate our method's superior performance in disentanglement and reconstruction. Furthermore, the model inherits enhanced interpretability and generalizability from MLLMs.

Claim

Disentangled representation learning (DRL) aims to identify and decompose underlying factors behind observations, thus facilitating data perception and generation.

IPO: Interpretable Prompt Optimization for Vision-Language Models Paper

Authors: Yingjun Du, Wenfang Sun, Cees G. M. Snoek
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2410.15397
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.

Claim

Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks.

ClevrSkills: Compositional Language and Visual Reasoning in Robotics Paper

Authors: S. Haresh, D. Dijkman, Apratim Bhattacharyya, Roland Memisevic
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.09052
Citations: 10
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, large vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

Claim

Robotics tasks are highly compositional by nature.

DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation Paper

Authors: Xueqing Wu, Ruichen Zheng, Jingzhen Sha, Te-Lin Wu, Hanyu Zhou, Mohan Tang, Kai-Wei Chang, Nanyun Peng, Haoran Huang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2403.02528
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensive. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs with a multi-turn prompting technique. We construct the DACO dataset, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2) 2k query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. We train a 6B supervised fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns reasonable data analysis capabilities. To further align the models with human preference, we use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases, validating the effectiveness of our proposed algorithm. Data and code are released at https://github.com/shirley-wu/daco

Claim

Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data.

LLaNA: Large Language and NeRF Assistant Paper

Authors: Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2406.11840
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.

Claim

Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data.

On the Power of Small-size Graph Neural Networks for Linear Programming Paper

Authors: Qian Li, Tian Ding, Linxin Yang, Minghui Ouyang, Qingjiang Shi, Ruoyu Sun
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-1222
Citations: 9
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Graph neural networks (GNNs) have recently emerged as powerful tools for addressing complex optimization problems. It has been theoretically demonstrated that GNNs can universally approximate the solution mapping functions of linear programming (LP) problems. However, these theoretical results typically require GNNs to have large parameter sizes. Conversely, empirical experiments have shown that relatively small GNNs can solve LPs effectively, revealing a significant discrepancy between theoretical predictions and practical observations. In this work, we aim to bridge this gap by providing a theoretical foundation for the effectiveness of smaller GNNs. We prove that polylogarithmic-depth, constant-width GNNs are sufficient to solve packing and covering LPs, two widely used classes of LPs. Our proof leverages the capability of GNNs to simulate a variant of the gradient descent algorithm on a carefully selected potential function. Additionally, we introduce a new GNN architecture, termed GD-Net. Experimental results demonstrate that GD-Net significantly outperforms conventional GNN structures while using fewer parameters.

Claim

Graph neural networks (GNNs) have recently emerged as powerful tools for addressing complex optimization problems.

OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images Paper

Authors: Ye Mao, Junpeng Jing, K. Mikolajczyk
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-3208
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.

Claim

Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance.

Learning Cut Generating Functions for Integer Programming Paper

Authors: Hongyu Cheng, Amitabh Basu
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2405.13992
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

The branch-and-cut algorithm is the method of choice to solve large scale integer programming problems in practice. A key ingredient of branch-and-cut is the use of cutting planes which are derived constraints that reduce the search space for an optimal solution. Selecting effective cutting planes to produce small branch-and-cut trees is a critical challenge in the branch-and-cut algorithm. Recent advances have employed a data-driven approach to select optimal cutting planes from a parameterized family, aimed at reducing the branch-and-bound tree size (in expectation) for a given distribution of integer programming instances. We extend this idea to the selection of the best cut generating function (CGF), which is a tool in the integer programming literature for generating a wide variety of cutting planes that generalize the well-known Gomory Mixed-Integer (GMI) cutting planes. We provide rigorous sample complexity bounds for the selection of an effective CGF from certain parameterized families that provably performs well for any specified distribution on the problem instances. Our empirical results show that the selected CGF can outperform the GMI cuts for certain distributions. Additionally, we explore the sample complexity of using neural networks for instance-dependent CGF selection.

Claim

The branch-and-cut algorithm is the method of choice to solve large scale integer programming problems in practice.

Slot-VLM: Object-Event Slots for Video-Language Modeling Paper

Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-0020
Citations: 8
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlm, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an effective method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a new framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design an Object-Event Slots module, i . e ., OE-Slots, that adaptively aggregates the dense video tokens from the vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, we build OE-Slots with two branches: the Object-Slots branch and the Event-Slots branch. The Object-Slots branch focuses on extracting object-centric slots from features of high spatial resolution but low frame sample rate, emphasizing detailed object information. The Event-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for effective video reasoning. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering 2 .

Claim

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding.

INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness Paper

Authors: Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2407.02518
Citations: 7
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming, code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes (\(+10%\) absolute improvements in all models).

Claim

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements.

Unleashing Region Understanding in Intermediate Layers for MLLM-based Referring Expression Generation Paper

Authors: Yaoyuan Liang, Zhuojun Cai, Jian Xu, Guanbo Huang, Yiran Wang, Xiao Liang, Jiahao Liu, Ziran Li, Jingang Wang, Shao-Lun Huang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-3833
Citations: 7
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: mllm).
Code: Not found.
Extraction: method/data pending

Abstract

. Abstract The Multi-modal Large Language Model (MLLM) based Referring Expression Generation (REG) task has gained increasing popularity, which aims to generate an unambiguous text description that applies to exactly one object or region in the image by leveraging foundation models. We empirically found that there exists a potential trade-off between the detailedness and the correctness of the descriptions for the referring objects. On the one hand, generating sentences with more details is usually required in order to provide more precise object descriptions. On the other hand, complicated sentences could easily increase the probability of hallucinations. To address this issue, we propose a training-free framework, named as “unleash-then-eliminate”, which first elicits the latent information in the intermediate layers, and then adopts a cycle-consistency-based decoding method to alleviate the production of hallucinations. Furthermore, to reduce the computational load of cycle-consistency-based decoding, we devise a Probing-based Importance Estimation method to statistically estimate the importance weights of intermediate layers within a subset. These importance weights are then incorporated into the decoding process over the entire dataset, intervening in the next token prediction from intermediate layers. Extensive experiments conducted on the RefCOCOg and PHD benchmarks show that our proposed framework could outperform existing methods on both semantic and hallucination-related metrics. Code will be made available in https://github.com/Glupayy/unleash-eliminate .

Claim

Not stated in abstract.

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Paper

Authors: Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2409.19734
Citations: 6
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vlms).
Code: Not found.
Extraction: method/data pending

Abstract

To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA) task, having 3 different VLMs"debate"about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu-eva-lab/VHD11K.

Claim

To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts.

ACES: Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models Paper

Authors: Julien Pourcel, Cédric Colas, Gaia Molinaro, P. Oudeyer, Laetitia Teodorescu
Year: 2024
Venue: Neural Information Processing Systems
DOI: Not stated.
Citations: 6
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

Automated Efficient Estimation using Monte Carlo Efficient Influence Functions Paper

Authors: Raj Agrawal, Sam Witty, A. Zane, Eli Bingham
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2403.00158
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis, alpha factor search (matched: programming, portfolio).
Code: Not found.
Extraction: method/data pending

Abstract

Many practical problems involve estimating low dimensional statistical quantities with high-dimensional models and datasets. Several approaches address these estimation tasks based on the theory of influence functions, such as debiased/double ML or targeted minimum loss estimation. This paper introduces Monte Carlo Efficient Influence Functions (MC-EIF), a fully automated technique for approximating efficient influence functions that integrates seamlessly with existing differentiable probabilistic programming systems. MC-EIF automates efficient statistical estimation for a broad class of models and target functionals that would previously require rigorous custom analysis. We prove that MC-EIF is consistent, and that estimators using MC-EIF achieve optimal \(\sqrt{N}\) convergence rates. We show empirically that estimators using MC-EIF are at parity with estimators using analytic EIFs. Finally, we demonstrate a novel capstone example using MC-EIF for optimal portfolio selection.

Claim

Many practical problems involve estimating low dimensional statistical quantities with high-dimensional models and datasets.

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming Paper

Authors: Victor-Alexandru Pădurean, A. Singla
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2406.09891
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized programming-related tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models' performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.

Claim

Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge.