Related papers: GRES: Generalized Referring Expression Segmentatio…
Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing…
Referring Expression Segmentation (RES) is a widely explored multi-modal task, which endeavors to segment the pre-existing object within a single image with a given linguistic expression. However, in broader real-world scenarios, it is not…
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more…
Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption…
The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for…
Referring Expression Segmentation (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions. However, its development is plagued by the expensive segmentation labels. To address…
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial…
Generalized Referring expressions can describe one object, several related objects, or none at all. Existing generalized referring segmentation (GRES) models treat all cases alike, predicting a single binary mask and ignoring how linguistic…
Referring Expression Segmentation (RES), which is aimed at localizing and segmenting the target according to the given language expression, has drawn increasing attention. Existing methods jointly consider the localization and segmentation…
Referring expression segmentation aims to segment an object described by a language expression from an image. Despite the recent progress on this task, existing models tackling this task may not be able to fully capture semantics and visual…
Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions. While substantial progress has been made in RES, the emergence of Generalized Referring…
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various…
Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This…
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the…
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks…
Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate…
Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and…
Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in…
The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES…
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. Different from the object detection task that queried object labels have been…