Related papers: High Efficiency Image Compression for Large Visual…

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods…

Computer Vision and Pattern Recognition · Computer Science 2025-02-27 Jianjian Li , Junquan Fan , Feng Tang , Gang Huang , Shitao Zhu , Songlin Liu , Nian Xie , Wulong Liu , Yong Liao

Bridging Compressed Image Latents and Multimodal Large Language Models

This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Chia-Hao Kao , Cheng Chien , Yu-Jen Tseng , Yi-Hsin Chen , Alessandro Gnutti , Shao-Yuan Lo , Wen-Hsiao Peng , Riccardo Leonardi

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Han Wang , Yuxiang Nie , Yongjie Ye , Deng GuanYu , Yanjie Wang , Shuai Li , Haiyang Yu , Jinghui Lu , Can Huang

Benchmarking and Enhancing VLM for Compressed Image Understanding

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Zifu Zhang , Tongda Xu , Siqi Li , Shengxi Li , Yue Zhang , Mai Xu , Yan Wang

Prompt-Guided Prefiltering for VLM Image Compression

The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically…

Image and Video Processing · Electrical Eng. & Systems 2026-04-02 Bardia Azizian , Ivan V. Bajic

Video Coding for Machine: Compact Visual Representation Compression for Intelligent Collaborative Analytics

Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of…

Computer Vision and Pattern Recognition · Computer Science 2021-10-19 Wenhan Yang , Haofeng Huang , Yueyu Hu , Ling-Yu Duan , Jiaying Liu

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Jiaying Zhu , Yurui Zhu , Xin Lu , Wenrui Yan , Dong Li , Kunlin Liu , Xueyang Fu , Zheng-Jun Zha

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Karthikeya KV

Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models

In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Xiaoyang Guo , Keze Wang

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han

LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Juntao Liu , Liqiang Niu , Wenchao Chen , Jie Zhou , Fandong Meng

A Preprocessing Framework for Video Machine Vision under Compression

There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics,…

Multimedia · Computer Science 2025-12-18 Fei Zhao , Mengxi Guo , Shijie Zhao , Junlin Li , Li Zhang , Xiaodong Xie

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based…

Computation and Language · Computer Science 2026-04-29 Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang , David Lo , Xiaodong Gu

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Junjie Chen , Xuyang Liu , Zichen Wen , Yiyu Wang , Siteng Huang , Honggang Chen

Variable Rate Video Compression using a Hybrid Recurrent Convolutional Learning Framework

In recent years, neural network-based image compression techniques have been able to outperform traditional codecs and have opened the gates for the development of learning-based video codecs. However, to take advantage of the high temporal…

Image and Video Processing · Electrical Eng. & Systems 2020-08-25 Aishwarya Jadhav

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Roy Xie , Dan Friedman , Donghan Yu , Bowen Pan , Christopher Fifty , Jang-Hyun Kim , Xianzhi Du , Zhe Gan , Vivek Rathod , Bhuwan Dhingra

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Zheyu Zhang , Ziqi Pang , Shixing Chen , Xiang Hao , Vimal Bhat , Yu-Xiong Wang

Variable Rate Image Compression with Recurrent Neural Networks

A large fraction of Internet traffic is now driven by requests from mobile devices with relatively small screens and often stringent bandwidth requirements. Due to these factors, it has become the norm for modern graphics-heavy websites to…

Computer Vision and Pattern Recognition · Computer Science 2016-03-03 George Toderici , Sean M. O'Malley , Sung Jin Hwang , Damien Vincent , David Minnen , Shumeet Baluja , Michele Covell , Rahul Sukthankar

NVILA: Efficient Frontier Visual Language Models

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Zhijian Liu , Ligeng Zhu , Baifeng Shi , Zhuoyang Zhang , Yuming Lou , Shang Yang , Haocheng Xi , Shiyi Cao , Yuxian Gu , Dacheng Li , Xiuyu Li , Yunhao Fang , Yukang Chen , Cheng-Yu Hsieh , De-An Huang , An-Chieh Cheng , Vishwesh Nath , Jinyi Hu , Sifei Liu , Ranjay Krishna , Daguang Xu , Xiaolong Wang , Pavlo Molchanov , Jan Kautz , Hongxu Yin , Song Han , Yao Lu

Large Language Model for Lossless Image Compression with Visual Prompts

Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded…

Image and Video Processing · Electrical Eng. & Systems 2025-02-25 Junhao Du , Chuqin Zhou , Ning Cao , Gang Chen , Yunuo Chen , Zhengxue Cheng , Li Song , Guo Lu , Wenjun Zhang