English
Related papers

Related papers: Data-efficient Large Vision Models through Sequent…

200 papers

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Yutong Bai , Xinyang Geng , Karttikeya Mangalam , Amir Bar , Alan Yuille , Trevor Darrell , Jitendra Malik , Alexei A Efros

Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Kai Jiang , Jiaxing Huang

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality…

Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Shenghao Xie , Wenqiang Zu , Mingyang Zhao , Duo Su , Shilong Liu , Ruohua Shi , Guoqi Li , Shanghang Zhang , Lei Ma

Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Jiamian Wang , Ziqi Zhou , Chaithanya Kumar Mummadi , Sohail Dianat , Majid Rabbani , Raghuveer Rao , Chen Qiu , Zhiqiang Tao

Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Donggyun Kim , Seongwoong Cho , Semin Kim , Chong Luo , Seunghoon Hong

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Zichuan Lin , Yicheng Liu , Yang Yang , Lvfang Tao , Deheng Ye

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision.…

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of…

Machine Learning · Computer Science 2014-06-25 Volodymyr Mnih , Nicolas Heess , Alex Graves , Koray Kavukcuoglu

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Erik Riise , Mehmet Onurcan Kaya , Dim P. Papadopoulos

Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-10-08 Geewook Kim , Minjoon Seo

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for…

Computer Vision and Pattern Recognition · Computer Science 2025-04-04 Yuqing Wang , Shuhuai Ren , Zhijie Lin , Yujin Han , Haoyuan Guo , Zhenheng Yang , Difan Zou , Jiashi Feng , Xihui Liu

The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-22 Fatemeh Saleh , Sadegh Aliakbarian , Charlie Hewitt , Lohit Petikam , Xiao-Xian , Antonio Criminisi , Thomas J. Cashman , Tadas Baltrušaitis

Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Dianyi Wang , Wei Song , Yikun Wang , Siyuan Wang , Kaicheng Yu , Zhongyu Wei , Jiaqi Wang

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime…

Computation and Language · Computer Science 2026-04-15 Jun Zhang , Yicheng Ji , Feiyang Ren , Yihang Li , Bowen Zeng , Zonghao Chen , Ke Chen , Lidan Shou , Gang Chen , Huan Li

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han

We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image…

Computer Vision and Pattern Recognition · Computer Science 2023-06-30 Zitian Chen , Mingyu Ding , Yikang Shen , Wei Zhan , Masayoshi Tomizuka , Erik Learned-Miller , Chuang Gan

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Sixun Dong , Juhua Hu , Steven Li , Wei Wen , Qi Qian

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Anjie Liu , Ziqin Gong , Yan Song , Yuxiang Chen , Xiaolong Liu , Hengtong Lu , Kaike Zhang , Chen Wei , Jun Wang

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Kelvin Li , Chuyi Shang , Leonid Karlinsky , Rogerio Feris , Trevor Darrell , Roei Herzig
‹ Prev 1 2 3 10 Next ›