Related papers: Kelix Technical Report

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Yang Jin , Kun Xu , Kun Xu , Liwei Chen , Chao Liao , Jianchao Tan , Quzhe Huang , Bin Chen , Chenyi Lei , An Liu , Chengru Song , Xiaoqiang Lei , Di Zhang , Wenwu Ou , Kun Gai , Yadong Mu

Multi-modal Auto-regressive Modeling via Visual Words

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Tianshuo Peng , Zuchao Li , Lefei Zhang , Hai Zhao , Ping Wang , Bo Du

Continuous Autoregressive Language Models

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic…

Computation and Language · Computer Science 2025-11-03 Chenze Shao , Darren Li , Fandong Meng , Jie Zhou

Liquid: Language Models are Scalable and Unified Multi-modal Generators

We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Junfeng Wu , Yi Jiang , Chuofan Ma , Yuliang Liu , Hengshuang Zhao , Zehuan Yuan , Song Bai , Xiang Bai

Growing Visual Generative Capacity for Pre-Trained MLLMs

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hanyu Wang , Jiaming Han , Ziyan Yang , Qi Zhao , Shanchuan Lin , Xiangyu Yue , Abhinav Shrivastava , Zhenheng Yang , Hao Chen

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Shenghao Xie , Wenqiang Zu , Mingyang Zhao , Duo Su , Shilong Liu , Ruohua Shi , Guoqi Li , Shanghang Zhang , Lei Ma

A Survey on Vision Autoregressive Model

Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Kai Jiang , Jiaxing Huang

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Walid Bousselham , Angie Boggust , Hendrik Strobelt , Hilde Kuehne

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Ruoyu Chen , Xiaoqing Guo , Kangwei Liu , Siyuan Liang , Shiming Liu , Qunli Zhang , Laiyuan Wang , Hua Zhang , Xiaochun Cao

Discrete Diffusion in Large Language and Multimodal Models: A Survey

In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel…

Machine Learning · Computer Science 2025-09-22 Runpeng Yu , Qi Li , Xinchao Wang

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Kaihang Pan , Wang Lin , Zhongqi Yue , Tenglong Ao , Liyu Jia , Wei Zhao , Juncheng Li , Siliang Tang , Hanwang Zhang

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Junshan Hu , Jialiang Mao , Zhikang Liu , Zhongpu Xia , Peng Jia , Xianpeng Lang

Do Vision Language Models Need to Process Image Tokens?

Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Sambit Ghosh , R. Venkatesh Babu , Chirag Agarwal

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Karthikeya KV

Data-efficient Large Vision Models through Sequential Autoregression

Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Jianyuan Guo , Zhiwei Hao , Chengcheng Wang , Yehui Tang , Han Wu , Han Hu , Kai Han , Chang Xu

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Han Wang , Yuxiang Nie , Yongjie Ye , Deng GuanYu , Yanjie Wang , Shuai Li , Haiyang Yu , Jinghui Lu , Can Huang

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Lijie Fan , Luming Tang , Siyang Qin , Tianhong Li , Xuan Yang , Siyuan Qiao , Andreas Steiner , Chen Sun , Yuanzhen Li , Tao Zhu , Michael Rubinstein , Michalis Raptis , Deqing Sun , Radu Soricut

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Machine Learning · Computer Science 2024-05-28 Florian Bordes , Richard Yuanzhe Pang , Anurag Ajay , Alexander C. Li , Adrien Bardes , Suzanne Petryk , Oscar Mañas , Zhiqiu Lin , Anas Mahmoud , Bargav Jayaraman , Mark Ibrahim , Melissa Hall , Yunyang Xiong , Jonathan Lebensold , Candace Ross , Srihari Jayakumar , Chuan Guo , Diane Bouchacourt , Haider Al-Tahan , Karthik Padthe , Vasu Sharma , Hu Xu , Xiaoqing Ellen Tan , Megan Richards , Samuel Lavoie , Pietro Astolfi , Reyhane Askari Hemmat , Jun Chen , Kushal Tirumala , Rim Assouel , Mazda Moayeri , Arjang Talattof , Kamalika Chaudhuri , Zechun Liu , Xilun Chen , Quentin Garrido , Karen Ullrich , Aishwarya Agrawal , Kate Saenko , Asli Celikyilmaz , Vikas Chandra

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Jiaming Han , Hao Chen , Yang Zhao , Hanyu Wang , Qi Zhao , Ziyan Yang , Hao He , Xiangyu Yue , Lu Jiang

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2026-04-20 Zekun Li , Sizhe An , Chengcheng Tang , Chuan Guo , Ivan Shugurov , Linguang Zhang , Amy Zhao , Srinath Sridhar , Lingling Tao , Abhay Mittal