Related papers: Tokenizing Semantic Segmentation with Run Length E…

Text4Seg: Reimagining Image Segmentation as Text Generation

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Mengcheng Lan , Chaofeng Chen , Yue Zhou , Jiaxing Xu , Yiping Ke , Xinjiang Wang , Litong Feng , Wayne Zhang

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Mengcheng Lan , Chaofeng Chen , Jiaxing Xu , Zongrui Li , Yiping Ke , Xudong Jiang , Yingchen Yu , Yunqing Zhao , Song Bai

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to…

Computation and Language · Computer Science 2025-08-22 Dong Liu , Yanxuan Yu

Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Anqi Zhang , Xiaokang Ji , Guangyu Gao , Jianbo Jiao , Chi Harold Liu , Yunchao Wei

Measure Twice, Cut Once: A Semantic-Oriented Approach to Video Temporal Localization with Video LLMs

Temporally localizing user-queried events through natural language is a crucial capability for video models. Recent methods predominantly adapt video LLMs to generate event boundary timestamps for temporal localization tasks, which struggle…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Zongshang Pang , Mayu Otani , Yuta Nakashima

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Zechen Bai , Tong He , Haiyang Mei , Pichao Wang , Ziteng Gao , Joya Chen , Lei Liu , Zheng Zhang , Mike Zheng Shou

SMITE: Segment Me In TimE

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity,…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Amirhossein Alimohammadi , Sauradip Nag , Saeid Asgari Taghanaki , Andrea Tagliasacchi , Ghassan Hamarneh , Ali Mahdavi Amiri

SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation

Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 JianHe Low , Ozge Mercanoglu Sincan , Richard Bowden

Neural Token Segmentation for High Token-Internal Complexity

Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward.…

Computation and Language · Computer Science 2022-03-22 Idan Brusilovsky , Reut Tsarfaty

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

Masked Motion Encoding for Self-Supervised Video Representation Learning

How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Xinyu Sun , Peihao Chen , Liangwei Chen , Changhao Li , Thomas H. Li , Mingkui Tan , Chuang Gan

Semantic Source Code Segmentation using Small and Large Language Models

Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and…

Software Engineering · Computer Science 2025-07-15 Abdelhalim Dahou , Ansgar Scherp , Sebastian Kurten , Brigitte Mathiak , Madhu Chauhan

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and…

Machine Learning · Computer Science 2018-09-20 Oliver Nina , Washington Garcia , Scott Clouse , Alper Yilmaz

Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples

Semantic segmentation is a key computer vision task that has been actively researched for decades. In recent years, supervised methods have reached unprecedented accuracy, however they require many pixel-level annotations for every new…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Nir Zabari , Yedid Hoshen

RTSeg: Real-time Semantic Segmentation Comparative Study

Semantic segmentation benefits robotics related applications especially autonomous driving. Most of the research on semantic segmentation is only on increasing the accuracy of segmentation models with little attention to computationally…

Computer Vision and Pattern Recognition · Computer Science 2020-05-19 Mennatullah Siam , Mostafa Gamal , Moemen Abdel-Razek , Senthil Yogamani , Martin Jagersand

Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Referring Image Segmentation (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. Various recent RIS models have achieved state-of-the-art performance by generating contextual tokens to model…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Minhyeok Lee , Dogyoon Lee , Jungho Lee , Suhwan Cho , Heeseung Choi , Ig-Jae Kim , Sangyoun Lee

Propagating Semantic Labels in Video Data

Semantic Segmentation combines two sub-tasks: the identification of pixel-level image masks and the application of semantic labels to those masks. Recently, so-called Foundation Models have been introduced; general models trained on very…

Computer Vision and Pattern Recognition · Computer Science 2023-10-03 David Balaban , Justin Medich , Pranay Gosar , Justin Hart

Selective Run-Length Encoding

Run-Length Encoding (RLE) is one of the most fundamental tools in data compression. However, its compression power drops significantly if there lacks consecutive elements in the sequence. In extreme cases, the output of the encoder may…

Data Structures and Algorithms · Computer Science 2023-12-29 Xutan Peng , Yi Zhang , Dejia Peng , Jiafa Zhu

VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

Token-based video representation has emerged as a promising approach for enabling large language models (LLMs) to interpret video content. However, existing token reduction techniques, such as pruning and merging, often disrupt essential…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Haichao Zhang , Yun Fu

Generalizable Entity Grounding via Assistance of Large Language Model

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level…

Computer Vision and Pattern Recognition · Computer Science 2024-02-07 Lu Qi , Yi-Wen Chen , Lehan Yang , Tiancheng Shen , Xiangtai Li , Weidong Guo , Yu Xu , Ming-Hsuan Yang