ML + Vision Top-6 Agent Survey - CVPR 2025 - Page 5 of 5¶
Overview | Previous: CVPR 2025 p4 | Page 5 / 5 | Next: ICCV 2023
- Venue: Computer Vision and Pattern Recognition
- Year: 2025
- Page: 5 / 5
- Papers: 121-124 / 124
Papers
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model Paper
Abstract
Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the FacePlayGround-240K dataset, the first pioneering large-scale, pixel-grounded face caption and question-answer (QA) dataset that includes 240K images, 47 mask categories, 5.4M mask annotations, and 7.3M grounded regions, meticulously curated for alignment pretraining and instruction-tuning. We present the GroundingFace framework, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.
Claim
Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning.
VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning Paper
Abstract
Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn a policy based on high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with selfsupervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle these challenges, we propose DGC, a method that Distills Guidance from Visual Language Models (VLMs) alongside self-supervised learning into a Compact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel promptingreasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance.
Claim
Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment.
Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants Paper
Abstract
Vision-language model (VLM) is one of the most important models for multi-modal tasks. Real industrial applications often meet the challenge of adapting VLMs to different scenarios, such as varying hardware platforms or performance requirements. Traditional methods involve training or fine-tuning to adapt multiple unique VLMs or using model compression techniques to create multiple compact models. These approaches are complex and resource-intensive. This paper introduces a novel paradigm called Once-Tuning-Multiple-Variants (OTMV). OTMV requires only a single tuning process to inject dynamic weight expansion capacity into the original VLM structure. This tuned VLM can then be expanded into multiple variants tailored for different scenarios in inference. The tuning mechanism of OTMV is inspired by the mathematical series expansion theorem, which helps to reduce the parameter size and memory requirements while maintaining accuracy for VLM. Experiment results show that OTMV-tuned models achieve comparable accuracy to baseline VLMs across various visual-language tasks. The experiments also demonstrate the dynamic expansion capability of OTMV-tuned VLMs, outperforming traditional model compression and adaptation techniques in terms of accuracy and efficiency.
Claim
Vision-language model (VLM) is one of the most important models for multi-modal tasks.
ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning Paper
Abstract
Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution. However, existing approaches simply treat synthetic images as complements to real images, rather than as standalone knowledge repositories stemming from distinct foundation models. To overcome this limitation, we frame synthetic images as an imagined base set (iBase), i.e., an independent, large-scale synthetic dataset encompassing diverse concepts. Building on this perspective, we introduce ImagineFSL, a novel CLIP adaptation methodology that pretrains on iBase and then fine-tunes for downstream few-shot tasks. We find that, compared to no pretraining, both supervised and self-supervised pretraining are beneficial, with the latter providing better performance. Based on on this finding, we propose an improved self-supervised method tailored for few-shot scenarios, enhancing the transferability of representations from synthetic to real image domains. Additionally, we present a systematic and scalable pipeline that employs chain-of-thought and in-context learning techniques, harnessing foundation models to automatically generate diverse, realistic images. Validated across eleven datasets, our methods consistently outperform state-of-the-art approaches by substantial margins.
Claim
Adapting CLIP models for few-shot recognition has recently attracted significant attention.
Overview | Previous: CVPR 2025 p4 | Page 5 / 5 | Next: ICCV 2023