Related papers: Data-efficient Large Vision Models through Sequent…

Sequential Modeling Enables Scalable Learning for Large Vision Models

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Yutong Bai , Xinyang Geng , Karttikeya Mangalam , Amir Bar , Alan Yuille , Trevor Darrell , Jitendra Malik , Alexei A Efros

A Survey on Vision Autoregressive Model

Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Kai Jiang , Jiaxing Huang

Autoregressive Models in Vision: A Survey

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Jing Xiong , Gongye Liu , Lun Huang , Chengyue Wu , Taiqiang Wu , Yao Mu , Yuan Yao , Hui Shen , Zhongwei Wan , Jinfa Huang , Chaofan Tao , Shen Yan , Huaxiu Yao , Lingpeng Kong , Hongxia Yang , Mi Zhang , Guillermo Sapiro , Jiebo Luo , Ping Luo , Ngai Wong

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Shenghao Xie , Wenqiang Zu , Mingyang Zhao , Duo Su , Shilong Liu , Ruohua Shi , Guoqi Li , Shanghang Zhang , Lei Ma

Visual Self-Refinement for Autoregressive Models

Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Jiamian Wang , Ziqi Zhou , Chaithanya Kumar Mummadi , Sohail Dianat , Majid Rabbani , Raghuveer Rao , Chen Qiu , Zhiqiang Tao

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Donggyun Kim , Seongwoong Cho , Semin Kim , Chong Luo , Seunghoon Hong

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Zichuan Lin , Yicheng Liu , Yang Yang , Lvfang Tao , Deheng Ye

Kelix Technical Report

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Boyang Ding , Chenglong Chu , Dunju Zang , Han Li , Jiangxia Cao , Kun Gai , Muhao Wei , Ruiming Tang , Shiyao Wang , Siyang Mao , Xinchen Luo , Yahui Liu , Zhixin Ling , Zhuoran Yang , Ziming Li , Chengru Song , Guorui Zhou , Guowang Zhang , Hao Peng , Hao Wang , Jiaxin Deng , Jin Ouyang , Jinghao Zhang , Lejian Ren , Qianqian Wang , Qigen Hu , Tao Wang , Xingmei Wang , Yiping Yang , Zixing Zhang , Ziqi Wang

Recurrent Models of Visual Attention

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of…

Machine Learning · Computer Science 2014-06-25 Volodymyr Mnih , Nicolas Heess , Alex Graves , Koray Kavukcuoglu

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Erik Riise , Mehmet Onurcan Kaya , Dim P. Papadopoulos

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-10-08 Geewook Kim , Minjoon Seo

Parallelized Autoregressive Visual Generation

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for…

Computer Vision and Pattern Recognition · Computer Science 2025-04-04 Yuqing Wang , Shuhuai Ren , Zhijie Lin , Yujin Han , Haoyuan Guo , Zhenheng Yang , Difan Zou , Jiashi Feng , Xihui Liu

DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-22 Fatemeh Saleh , Sadegh Aliakbarian , Charlie Hewitt , Lohit Petikam , Xiao-Xian , Antonio Criminisi , Thomas J. Cashman , Tadas Baltrušaitis

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Dianyi Wang , Wei Song , Yikun Wang , Siyuan Wang , Kaicheng Yu , Zhongyu Wei , Jiaqi Wang

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime…

Computation and Language · Computer Science 2026-04-15 Jun Zhang , Yicheng Ji , Feiyang Ren , Yihang Li , Bowen Zeng , Zonghao Chen , Ke Chen , Lidan Shou , Gang Chen , Huan Li

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han

An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image…

Computer Vision and Pattern Recognition · Computer Science 2023-06-30 Zitian Chen , Mingyu Ding , Yikang Shen , Wei Zhan , Masayoshi Tomizuka , Erik Learned-Miller , Chuang Gan

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Sixun Dong , Juhua Hu , Steven Li , Wei Wen , Qi Qian

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Anjie Liu , Ziqin Gong , Yan Song , Yuxiang Chen , Xiaolong Liu , Hengtong Lu , Kaike Zhang , Chen Wei , Jun Wang

Latent Implicit Visual Reasoning

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Kelvin Li , Chuyi Shang , Leonid Karlinsky , Rogerio Feris , Trevor Darrell , Roei Herzig