ML + Vision Top-6 Agent Survey - ICLR 2024 - Page 5 of 5¶
Overview | Previous: ICLR 2024 p4 | Page 5 / 5 | Next: ICLR 2025 p1
- Venue: International Conference on Learning Representations
- Year: 2024
- Page: 5 / 5
- Papers: 121-124 / 124
Papers
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension Paper
Abstract
Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression Comprehension (REC). Fine-tuning usually requires 'white-box' access to the model's architecture and weights, which is not always feasible due to proprietary or privacy concerns. In this work, we propose LLM-wrapper, a method for 'black-box' adaptation of VLMs for the REC task using Large Language Models (LLMs). LLM-wrapper capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression, from candidates generated by a zero-shot black-box VLM. Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings, it is versatile as it works with any VLM, it transfers to new VLMs and datasets, and it allows for the adaptation of an ensemble of VLMs. We evaluate LLM-wrapper on multiple datasets using different VLMs and LLMs, demonstrating significant performance improvements and highlighting the versatility of our method. While LLM-wrapper is not meant to directly compete with standard white-box fine-tuning, it offers a practical and effective alternative for black-box VLM adaptation. Code and checkpoints are available at https://github.com/valeoai/LLM_wrapper .
Claim
Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression Comprehension (REC).
OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference Paper
Abstract
Classification tasks play a fundamental role in various applications, spanning domains such as healthcare, natural language processing and computer vision. With the growing popularity and capacity of machine learning models, people can easily access trained classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity (such as large foundation models) usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40% cost reduction with little to no accuracy drop.
Claim
Classification tasks play a fundamental role in various applications, spanning domains such as healthcare, natural language processing and computer vision.
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation Paper
Abstract
Not stated in metadata.
Claim
Not stated in abstract.
Large Language Models as Automated Aligners for benchmarking Vision-Language Models Paper
Abstract
Not stated in metadata.
Claim
Not stated in abstract.
Overview | Previous: ICLR 2024 p4 | Page 5 / 5 | Next: ICLR 2025 p1