Ruize Han
Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object…
Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly…
Existing studies typically investigate domain shift and category shift as independent problems, however, in real-world scenarios, the two types of shifts often occur simultaneously and interact, leading to significant degradation in…
Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data…
This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR)…
This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution…
Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime,…
Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent…
Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling…
Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as…
Gait recognition is an important AI task, which has been progressed rapidly with the development of deep learning. However, existing learning based gait recognition methods mainly focus on the single domain, especially the constrained…
Vehicles, as one of the most common and significant objects in the real world, the researches on which using computer vision technologies have made remarkable progress, such as vehicle detection, vehicle re-identification, etc. To search an…
Person re-identification (Re-ID) is an important task and has significant applications for public security and information forensics, which has progressed rapidly with the development of deep learning. In this work, we investigate a novel…
Open-vocabulary object perception has become an important topic in artificial intelligence, which aims to identify objects with novel classes that have not been seen during training. Under this setting, open-vocabulary object detection…
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel…
We study a novel yet practical problem of open-corpus multi-object tracking (OCMOT), which extends the MOT into localizing, associating, and recognizing generic-category objects of both seen (base) and unseen (novel) classes, but without…
A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal…
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different…
Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across…
Gait recognition, a long-distance biometric technology, has aroused intense interest recently. Currently, the two dominant gait recognition works are appearance-based and model-based, which extract features from silhouettes and skeletons,…