ML + Vision Top-6 Agent Survey - NeurIPS 2024 - Page 6 of 6¶

Venue: Neural Information Processing Systems
Year: 2024
Page: 6 / 6
Papers: 151-167 / 167

Papers

GLinSAT: The General Linear Satisfiability Neural Network Layer By Accelerated Gradient Descent Paper

Authors: Hongtai Zeng, Chao Yang, Yanzhen Zhou, Cheng Yang, Qinglai Guo
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2409.17500
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis, alpha factor search (matched: programming, portfolio).
Code: Not found.
Extraction: method/data pending

Abstract

Ensuring that the outputs of neural networks satisfy specific constraints is crucial for applying neural networks to real-life decision-making problems. In this paper, we consider making a batch of neural network outputs satisfy bounded and general linear constraints. We first reformulate the neural network output projection problem as an entropy-regularized linear programming problem. We show that such a problem can be equivalently transformed into an unconstrained convex optimization problem with Lipschitz continuous gradient according to the duality theorem. Then, based on an accelerated gradient descent algorithm with numerical performance enhancement, we present our architecture, GLinSAT, to solve the problem. To the best of our knowledge, this is the first general linear satisfiability layer in which all the operations are differentiable and matrix-factorization-free. Despite the fact that we can explicitly perform backpropagation based on automatic differentiation mechanism, we also provide an alternative approach in GLinSAT to calculate the derivatives based on implicit differentiation of the optimality condition. Experimental results on constrained traveling salesman problems, partial graph matching with outliers, predictive portfolio allocation and power system unit commitment demonstrate the advantages of GLinSAT over existing satisfiability layers. Our implementation is available at \url{https://github.com/HunterTracer/GLinSAT}.

Claim

Ensuring that the outputs of neural networks satisfy specific constraints is crucial for applying neural networks to real-life decision-making problems.

Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor Paper

Authors: Keji He, Kehan Chen, Jiawang Bai, Yan Huang, Qi Wu, Shu-Tao Xia, Liang Wang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-1572
Citations: 5
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: embodied agents (matched: navigation agent).
Code: Not found.
Extraction: method/data pending

Abstract

Vision-and-Language Navigation (VLN) requires an agent to dynamically explore environments following natural language. The VLN agent, closely integrated into daily lives, poses a substantial threat to the security of privacy and property upon the occurrence of malicious behavior. However, this serious issue has long been overlooked. In this paper, we pioneer the exploration of an object-aware backdoored VLN, achieved by implanting object-aware backdoors during the training phase. Tailored to the unique VLN nature of cross-modality and continuous decision-making, we propose a novel backdoored VLN paradigm: IPR Backdoor. This enables the agent to act in abnormal behavior once encountering the object triggers during language-guided navigation in unseen environments, thereby executing an attack on the target scene. Experiments demonstrate the effectiveness of our method in both physical and digital spaces across different VLN agents, as well as its robustness to various visual and textual variations. Additionally, our method also well ensures navigation performance in normal scenarios with remarkable stealthiness. The code is available at https://github.com/Chenkehan21/VLN-ATT.

Claim

Vision-and-Language Navigation (VLN) requires an agent to dynamically explore environments following natural language.

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model Paper

Authors: Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2405.17815
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer. Codes are available at https://github.com/liuhaogeng/Anchor-Former.

Claim

In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs).

Targeted Sequential Indirect Experiment Design Paper

Authors: Elisabeth Ailer, Niclas Dern, Jason S. Hartford, Niki Kilbertus
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2405.19985
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: autonomous research agents (matched: experiment design).
Code: Not found.
Extraction: method/data pending

Abstract

Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health. Such queries are inherently causal (rather than purely associational), but in many settings, experiments can not be conducted directly on the target variables of interest, but are indirect. Therefore, they perturb the target variable, but do not remove potential confounding factors. If, additionally, the resulting experimental measurements are multi-dimensional and the studied mechanisms nonlinear, the query of interest is generally not identified. We develop an adaptive strategy to design indirect experiments that optimally inform a targeted query about the ground truth mechanism in terms of sequentially narrowing the gap between an upper and lower bound on the query. While the general formulation consists of a bi-level optimization procedure, we derive an efficiently estimable analytical kernel-based estimator of the bounds for the causal effect, a query of key interest, and demonstrate the efficacy of our approach in confounded, multivariate, nonlinear synthetic settings.

Claim

Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health.

Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models Paper

Authors: Hongliang Wei, Xingtao Wang, Xianqi Zhang, Xiaopeng Fan, Debin Zhao
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-3538
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, large vision language models, lvlms).
Code: Not found.
Extraction: method/data pending

Abstract

Given different instructions, large vision-language models (LVLMs) exhibit different degrees of object hallucinations, posing a signiﬁcant challenge to the evaluation of object hallucinations. Overcoming this challenge, existing object hallucination evaluation methods average the results obtained from a set of instructions. However, these methods fail to provide consistent evaluation across instruction sets that generate image descriptions of signiﬁcantly different lengths. In this paper, we present the ﬁrst systematic investigation into the effect of instructions on object hallucinations in LVLMs, with a speciﬁc focus on the role played by image description lengths. A valuable ﬁnding is that instructions indirectly affect hallucinations through the length of image descriptions. The longer the image description, the higher the object hallucination degree. Accordingly, we ﬁt an informative length-hallucination curve, upon which a ﬁne-grained evaluation framework named LeHaCE is introduced for evaluating object hallucinations at any given image description length. LeHaCE evaluates the object hallucination degree at a uniform image description length to mitigate the effect of description lengths, promoting stability and fairness. Moreover, LeHaCE incorporates the curve slope as an innovative hallucination evaluation metric, reﬂecting the extent to which the object hallucination degree is affected by the image description length, achieving a more comprehensive evaluation. Experimental results demonstrate that LeHaCE provides a more stable, fair, and comprehensive evaluation of object hallucinations in LVLMs compared to existing methods.

Claim

Given different instructions, large vision-language models (LVLMs) exhibit different degrees of object hallucinations, posing a signiﬁcant challenge to the evaluation of object hallucinations.

Homology Consistency Constrained Efficient Tuning for Vision-Language Models Paper

Authors: Huatian Zhang, Lei Zhang, Yongdong Zhang, Zhendong Mao
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-2953
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models, vlms).
Code: Not found.
Extraction: method/data pending

Abstract

Efficient transfer learning has shown remarkable performance in tuning large-scale vision-language models (VLMs) toward downstream tasks with limited data resources. The key challenge of efficient transfer lies in adjusting image-text alignment to be task-specific while preserving pre-trained general knowledge. However, existing methods adjust image-text alignment merely on a set of observed samples, e.g. , data set and external knowledge base, which cannot guarantee to keep the correspondence of general concepts between image and text latent manifolds without being disrupted and thereby a weak generalization of the adjusted alignment. In this work, we propose a Homology Consistency (HC) constraint for efficient transfer on VLMs, which explicitly constrains the correspondence of image and text latent manifolds through structural equivalence based on persistent homology in downstream tuning. Specifically, we build simplicial complex on the top of data to mimic the topology of latent manifolds, then track the persistence of the homology classes of topological features across multiple scales, and guide the directions of persistence tracks in image and text manifolds to coincide each other, with a deviating perturbation additionally. For practical application, we tailor the implementation of our proposed HC constraint for two main paradigms of adapter tuning. Extensive experiments on few-shot learning over 11 datasets and domain generalization demonstrate the effectiveness and robustness of our method.

Claim

Efficient transfer learning has shown remarkable performance in tuning large-scale vision-language models (VLMs) toward downstream tasks with limited data resources.

Déjà Vu Memorization in Vision-Language Models Paper

Authors: Bargav Jayaraman, Chuan Guo, Kamalika Chaudhuri
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2402.02103
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games Paper

Authors: Junlin Xie, Ruifei Zhang, Zhihong Chen, Xiang Wan, Guanbin Li
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-2751
Citations: 4
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: multimodal agents (matched: multimodal agents).
Code: Not found.
Extraction: method/data pending

Abstract

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs). An LMA expected to perform practical tasks must possess a range of capabilities, including multimodal perception, interaction, reasoning, and decision-making skills. However, existing benchmarks are limited in assessing compositional skills and actions ♡ Equal contribution ♣ Corresponding authors.

Claim

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs).

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model Paper

Authors: Chaoya Jiang, Hongrui Jia, Haiyang Xu, Wei Ye, Mengfan Dong, Mingshi Yan, Ji Zhang, Fei Huang, Shikun Zhang
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2408.12321
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: multimodal large language model, mllms, multimodal large language models).
Code: Not found.
Extraction: method/data pending

Abstract

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.

Claim

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models Paper

Authors: Jiachen Liang, Ruibing Hou, Minyang Hu, Hong Chang, Shiguang Shan, Xilin Chen
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.06921
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: vision-language models (matched: vision language models).
Code: Not found.
Extraction: method/data pending

Abstract

Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP's visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at https://github.com/GIT-LJc/UMFC.

Claim

Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities.

Training Binary Neural Networks via Gaussian Variational Inference and Low-Rank Semidefinite Programming Paper

Authors: L. Orecchia, Jiawei Hu, Xue He, W. Mark, Xulei Yang, Min Wu, Xue Geng
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-2042
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

Improving the training of Binarized Neural Networks (BNNs) is a longstanding challenge whose outcome can significantly affect our ability to deploy deep learning ubiquitously. Current methods heavily rely on latent weights and the heuristic straight-through estimator (STE), which enable the application of SGD-based optimizers to the combinatorial training problem, but remain theoretically poorly understood. In this paper, we propose an optimization framework for BNN training based on Gaussian variational inference. Our approach yields a non-convex linear programming formulation that theoretically motivates the use of latent weights, STE and weight clipping . More importantly, it allows us to go beyond latent weights to formulate and solve low-rank semidefinite programming (SDP) relaxations that explicitly model and learn pairwise correlations between weights during training , resulting in improved accuracy. Our empirical evaluation on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet datasets shows our method consistently outperforms all state-of-the-art algorithms for training BNNs.

Claim

Improving the training of Binarized Neural Networks (BNNs) is a longstanding challenge whose outcome can significantly affect our ability to deploy deep learning ubiquitously.

Multidimensional Fractional Programming for Normalized Cuts Paper

Authors: Yannan Chen, Beichen Huang, Licheng Zhao, Kaiming Shen
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-2843
Citations: 3
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

The Normalized cut (NCut) problem is a fundamental and yet notoriously difficult one in the unsupervised clustering field. Because the NCut problem is fractionally structured, the fractional programming (FP) based approach has worked its way into a new frontier. However, the conventional FP techniques are insufficient: the classic Dinkelbach’s transform can only deal with a single ratio and hence is limited to the two-class clustering, while the state-of-the-art quadratic transform accounts for multiple ratios but fails to convert the NCut problem to a tractable form. This work advocates a novel extension of the quadratic transform to the multidimensional ratio case, thereby recasting the fractional 0-1 NCut problem into a bipartite matching problem—which can be readily solved in an iterative manner. Furthermore, we explore the connection between the proposed multidimensional FP method and the minorization-maximization theory to verify the convergence.

Claim

The Normalized cut (NCut) problem is a fundamental and yet notoriously difficult one in the unsupervised clustering field.

Fast Proxy Experiment Design for Causal Effect Identification Paper

Authors: Sepehr Elahi, S. Akbari, Jalal Etesami, Negar Kiyavash, Patrick Thiran
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2407.05330
Citations: 1
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: autonomous research agents (matched: experiment design).
Code: Not found.
Extraction: method/data pending

Abstract

Identifying causal effects is a key problem of interest across many disciplines. The two long-standing approaches to estimate causal effects are observational and experimental (randomized) studies. Observational studies can suffer from unmeasured confounding, which may render the causal effects unidentifiable. On the other hand, direct experiments on the target variable may be too costly or even infeasible to conduct. A middle ground between these two approaches is to estimate the causal effect of interest through proxy experiments, which are conducted on variables with a lower cost to intervene on compared to the main target. Akbari et al. [2022] studied this setting and demonstrated that the problem of designing the optimal (minimum-cost) experiment for causal effect identification is NP-complete and provided a naive algorithm that may require solving exponentially many NP-hard problems as a sub-routine in the worst case. In this work, we provide a few reformulations of the problem that allow for designing significantly more efficient algorithms to solve it as witnessed by our extensive simulations. Additionally, we study the closely-related problem of designing experiments that enable us to identify a given effect through valid adjustments sets.

Claim

Identifying causal effects is a key problem of interest across many disciplines.

Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation Paper

Authors: Xuehao Cui, Guangyang Wu, Zhenghao Gan, Guangtao Zhai, Xiaohong Liu
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.48550/arXiv.2411.19246
Citations: 1
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: code generation).
Code: Not found.
Extraction: method/data pending

Abstract

Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity. Addressing these imperfections, we present Face2QR-a novel pipeline specifically designed for generating personalized QR codes that harmoniously blend aesthetics, face identity, and scannability. Our pipeline introduces three innovative components. First, the ID-refined QR integration (IDQR) seamlessly intertwines the background styling with face ID, utilizing a unified Stable Diffusion (SD)-based framework with control networks. Second, the ID-aware QR ReShuffle (IDRS) effectively rectifies the conflicts between face IDs and QR patterns, rearranging QR modules to maintain the integrity of facial features without compromising scannability. Lastly, the ID-preserved Scannability Enhancement (IDSE) markedly boosts scanning robustness through latent code optimization, striking a delicate balance between face ID, aesthetic quality and QR functionality. In comprehensive experiments, Face2QR demonstrates remarkable performance, outperforming existing approaches, particularly in preserving facial recognition features within custom QR code designs. Codes are available at \(\href{https://github.com/cavosamir/Face2QR}{\text{this URL link}}\).

Claim

Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity.

Learning Generalized Linear Programming Value Functions Paper

Authors: Tu Anh-Nguyen, Joey Huchette, Christian Tjandraatmadja
Year: 2024
Venue: Neural Information Processing Systems
DOI: 10.52202/079017-4305
Citations: 1
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: program synthesis (matched: programming).
Code: Not found.
Extraction: method/data pending

Abstract

We develop a theoretically-grounded learning method for the Generalized Linear Programming Value Function (GVF), which models the optimal value of a linear programming (LP) problem as its objective and constraint bounds vary. This function plays a fundamental role in algorithmic techniques for large-scale optimization, particularly in decomposition for two-stage mixed-integer linear programs (MILPs). This paper establishes a structural characterization of the GVF that enables it to be modeled as a particular neural network architecture, which we then use to learn the GVF in a way that beneﬁts from three notable properties. First, our method produces a true under-approximation of the value function with respect to the constraint bounds. Second, the model is input-convex in the constraint bounds, which not only matches the structure of the GVF but also enables the trained model to be efﬁciently optimized over using LP. Finally, our learning method is unsupervised, meaning that training data generation does not require computing LP optimal values, which can be prohibitively expensive at large scales. We numerically show that our method can approximate the GVF well, even when compared to supervised methods that collect training data by solving an LP for each data point. Furthermore, as an application of our framework, we develop a fast heuristic method for large-scale two-stage MILPs with continuous second-stage variables, via a compact reformulation that can be solved faster than the full model linear relaxation at large scales and orders of magnitude faster than the original model.

Claim

We develop a theoretically-grounded learning method for the Generalized Linear Programming Value Function (GVF), which models the optimal value of a linear programming (LP) problem as its objective and constraint bounds vary.

Trading off Consistency and Dimensionality of Convex Surrogates for Multiclass Classification Paper

Authors: Enrique B. Nueve, Dhamma Kimpara, Bo Waggoner, Jessie Finocchiaro
Year: 2024
Venue: Neural Information Processing Systems
DOI: Not stated.
Citations: 1
Relevance: 3 / 5
Why selected: Heuristic keyword/alias matches: alpha factor search (matched: trading).
Code: Not found.
Extraction: method/data pending

Abstract

Not stated in metadata.

Claim

Not stated in abstract.