Related papers: Representation Alignment Contrastive Regularizatio…
Time-series representation learning can extract representations from data with temporal dynamics and sparse labels. When labeled data are sparse but unlabeled data are abundant, contrastive learning, i.e., a framework to learn a latent…
The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However,…
We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondences…
Discriminative representation is crucial for the association step in multi-object tracking. Recent work mainly utilizes features in single or neighboring frames for constructing metric loss and empowering networks to extract representation…
As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework. However, most existing keypoint trackers are incapable of…
Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training…
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g.,…
The recent trend in multiple object tracking (MOT) is heading towards leveraging deep learning to boost the tracking performance. In this paper, we propose a novel solution named TransSTAM, which leverages Transformer to effectively model…
Spatial-temporal data modeling aims to mine the underlying spatial relationships and temporal dependencies of objects in a system. However, most existing methods focus on the modeling of spatial-temporal data in a single mode, lacking the…
Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long…
There has been an increasing focus in learning interpretable feature representations, particularly in applications such as medical image analysis that require explainability, whilst relying less on annotated data (since annotations can be…
Online tracking of multiple objects in videos requires strong capacity of modeling and matching object appearances. Previous methods for learning appearance embedding mostly rely on instance-level matching without considering the temporal…
Most of existing correlation filter-based tracking approaches only estimate simple axis-aligned bounding boxes, and very few of them is capable of recovering the underlying similarity transformation. To tackle this challenging problem, in…
In representation learning (RL), how to make the learned representations easy to interpret and less overfitted to training data are two important but challenging issues. To address these problems, we study a new type of regulariza- tion…
Sparse representation has been widely studied in visual tracking, which has shown promising tracking performance. Despite a lot of progress, the visual tracking problem is still a challenging task due to appearance variations over time. In…
Contrastive representation learning has been outstandingly successful in practice. In this work, we identify two key properties related to the contrastive loss: (1) alignment (closeness) of features from positive pairs, and (2) uniformity…
Spatial-temporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to…
Capturing contextual dependencies has proven useful to improve the representational power of deep neural networks. Recent approaches that focus on modeling global context, such as self-attention and non-local operation, achieve this goal by…
Robust feature representation plays significant role in visual tracking. However, it remains a challenging issue, since many factors may affect the experimental performance. The existing method which combine different features by setting…
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment…