English
Related papers

Related papers: Kelix Technical Report

200 papers

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Yang Jin , Kun Xu , Kun Xu , Liwei Chen , Chao Liao , Jianchao Tan , Quzhe Huang , Bin Chen , Chenyi Lei , An Liu , Chengru Song , Xiaoqiang Lei , Di Zhang , Wenwu Ou , Kun Gai , Yadong Mu

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Tianshuo Peng , Zuchao Li , Lefei Zhang , Hai Zhao , Ping Wang , Bo Du

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic…

Computation and Language · Computer Science 2025-11-03 Chenze Shao , Darren Li , Fandong Meng , Jie Zhou

We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Junfeng Wu , Yi Jiang , Chuofan Ma , Yuliang Liu , Hengshuang Zhao , Zehuan Yuan , Song Bai , Xiang Bai

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hanyu Wang , Jiaming Han , Ziyan Yang , Qi Zhao , Shanchuan Lin , Xiangyu Yue , Abhinav Shrivastava , Zhenheng Yang , Hao Chen

Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Shenghao Xie , Wenqiang Zu , Mingyang Zhao , Duo Su , Shilong Liu , Ruohua Shi , Guoqi Li , Shanghang Zhang , Lei Ma

Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Kai Jiang , Jiaxing Huang

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Walid Bousselham , Angie Boggust , Hendrik Strobelt , Hilde Kuehne

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Ruoyu Chen , Xiaoqing Guo , Kangwei Liu , Siyuan Liang , Shiming Liu , Qunli Zhang , Laiyuan Wang , Hua Zhang , Xiaochun Cao

In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel…

Machine Learning · Computer Science 2025-09-22 Runpeng Yu , Qi Li , Xinchao Wang

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Kaihang Pan , Wang Lin , Zhongqi Yue , Tenglong Ao , Liyu Jia , Wei Zhao , Juncheng Li , Siliang Tang , Hanwang Zhang

Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Junshan Hu , Jialiang Mao , Zhikang Liu , Zhongpu Xia , Peng Jia , Xianpeng Lang

Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Sambit Ghosh , R. Venkatesh Babu , Chirag Agarwal

This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Karthikeya KV

Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Jianyuan Guo , Zhiwei Hao , Chengcheng Wang , Yehui Tang , Han Wu , Han Hu , Kai Han , Chang Xu

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Han Wang , Yuxiang Nie , Yongjie Ye , Deng GuanYu , Yanjie Wang , Shuai Li , Haiyang Yu , Jinghui Lu , Can Huang

We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Lijie Fan , Luming Tang , Siyang Qin , Tianhong Li , Xuan Yang , Siyuan Qiao , Andreas Steiner , Chen Sun , Yuanzhen Li , Tao Zhu , Michael Rubinstein , Michalis Raptis , Deqing Sun , Radu Soricut

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Jiaming Han , Hao Chen , Yang Zhao , Hanyu Wang , Qi Zhao , Ziyan Yang , Hao He , Xiangyu Yue , Lu Jiang

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2026-04-20 Zekun Li , Sizhe An , Chengcheng Tang , Chuan Guo , Ivan Shugurov , Linguang Zhang , Amy Zhao , Srinath Sridhar , Lingling Tao , Abhay Mittal
‹ Prev 1 2 3 10 Next ›