Related papers: Obj2Seq: Formatting Objects as Sequences with Clas…

Pix2seq: A Language Modeling Framework for Object Detection

We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Ting Chen , Saurabh Saxena , Lala Li , David J. Fleet , Geoffrey Hinton

Point2Seq: Detecting 3D Objects as Sequences

We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds. In contrast to previous methods that normally {predict attributes of 3D objects all at once}, we expressively model the…

Computer Vision and Pattern Recognition · Computer Science 2022-03-28 Yujing Xue , Jiageng Mao , Minzhe Niu , Hang Xu , Michael Bi Mi , Wei Zhang , Xiaogang Wang , Xinchao Wang

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Document structure extraction has been a widely researched area for decades with recent works performing it as a semantic segmentation task over document images using fully-convolution networks. Such methods are limited by image resolution…

Machine Learning · Computer Science 2021-07-12 Milan Aggarwal , Hiresh Gupta , Mausoom Sarkar , Balaji Krishnamurthy

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Rongyao Fang , Shilin Yan , Zhaoyang Huang , Jingqiu Zhou , Hao Tian , Jifeng Dai , Hongsheng Li

OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts

Generating captions for images is a task that has recently received considerable attention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information provided is a set of objects and their…

Computer Vision and Pattern Recognition · Computer Science 2017-07-25 Xuwang Yin , Vicente Ordonez

Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction

Ordinal regression refers to classifying object instances into ordinal categories. It has been widely studied in many scenarios, such as medical disease grading, movie rating, etc. Known methods focused only on learning inter-class ordinal…

Artificial Intelligence · Computer Science 2023-07-24 Jinhong Wang , Yi Cheng , Jintai Chen , Tingting Chen , Danny Chen , Jian Wu

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection…

Computer Vision and Pattern Recognition · Computer Science 2024-09-30 Min Yang , Zichen Zhang , Limin Wang

Improving Token-based Object Detection with Video

This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by…

Computer Vision and Pattern Recognition · Computer Science 2025-08-21 Abhineet Singh , Nilanjan Ray

Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking

In this paper, we introduce a new sequence-to-sequence learning framework for RGB-based and multi-modal object tracking. First, we present SeqTrack for RGB-based tracking. It casts visual tracking as a sequence generation task, forecasting…

Computer Vision and Pattern Recognition · Computer Science 2024-03-28 Xin Chen , Ben Kang , Jiawen Zhu , Dong Wang , Houwen Peng , Huchuan Lu

A Unified Sequence Interface for Vision Tasks

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Ting Chen , Saurabh Saxena , Lala Li , Tsung-Yi Lin , David J. Fleet , Geoffrey Hinton

Object-Centric Framework for Video Moment Retrieval

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Zongyao Li , Yongkang Wong , Satoshi Yamazaki , Jianquan Liu , Mohan Kankanhalli

Lane2Seq: Towards Unified Lane Detection via Sequence Generation

In this paper, we present a novel sequence generation-based framework for lane detection, called Lane2Seq. It unifies various lane detection formats by casting lane detection as a sequence generation task. This is different from previous…

Computer Vision and Pattern Recognition · Computer Science 2024-02-28 Kunyang Zhou

Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Video Object Segmentation (VOS) is an active research area of the visual domain. One of its fundamental sub-tasks is semi-supervised / one-shot learning: given only the segmentation mask for the first frame, the task is to provide…

Computer Vision and Pattern Recognition · Computer Science 2020-04-28 Fatemeh Azimi , Benjamin Bischke , Sebastian Palacio , Federico Raue , Joern Hees , Andreas Dengel

Multi-Task Consistency for Active Learning

Learning-based solutions for vision tasks require a large amount of labeled training data to ensure their performance and reliability. In single-task vision-based settings, inconsistency-based active learning has proven to be effective in…

Computer Vision and Pattern Recognition · Computer Science 2023-06-22 Aral Hekimoglu , Philipp Friedrich , Walter Zimmer , Michael Schmidt , Alvaro Marcos-Ramiro , Alois C. Knoll

2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC

Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zhixiong Zhang , Shuangrui Ding , Xiaoyi Dong , Yuhang Zang , Yuhang Cao , Jiaqi Wang

Object-QA: Towards High Reliable Object Quality Assessment

In object recognition applications, object images usually appear with different quality levels. Practically, it is very important to indicate object image qualities for better application performance, e.g. filtering out low-quality object…

Computer Vision and Pattern Recognition · Computer Science 2020-05-28 Jing Lu , Baorui Zou , Zhanzhan Cheng , Shiliang Pu , Shuigeng Zhou , Yi Niu , Fei Wu

Self-supervised Object-Centric Learning for Videos

Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the…

Computer Vision and Pattern Recognition · Computer Science 2023-10-12 Görkay Aydemir , Weidi Xie , Fatma Güney

Object-Centric Multi-Task Learning for Human Instances

Human is one of the most essential classes in visual recognition tasks such as detection, segmentation, and pose estimation. Although much effort has been put into individual tasks, multi-task learning for these three tasks has been rarely…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Hyeongseok Son , Sangil Jung , Solae Lee , Seongeun Kim , Seung-In Park , ByungIn Yoo

Universal Instance Perception as Object Discovery and Retrieval

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Bin Yan , Yi Jiang , Jiannan Wu , Dong Wang , Ping Luo , Zehuan Yuan , Huchuan Lu

Object Pursuit: Building a Space of Objects via Discriminative Weight Generation

We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Chuanyu Pan , Yanchao Yang , Kaichun Mo , Yueqi Duan , Leonidas Guibas