Related papers: Learning from One Continuous Video Stream
Despite rapid advancements in lifelong learning (LLL) research, a large body of research mainly focuses on improving the performance in the existing \textit{static} continual learning (CL) setups. These methods lack the ability to succeed…
The practicality of a video surveillance system is adversely limited by the amount of queries that can be placed on human resources and their vigilance in response. To transcend this limitation, a major effort under way is to include…
Imagine a smart camera trap selectively clicking pictures to understand animal movement patterns within a particular habitat. These "snapshots", or pieces of data captured from a data stream at adaptively chosen times, provide a glimpse of…
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training…
Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily…
This work demonstrates that training autoregressive video diffusion models from a single video stream$\unicode{x2013}$resembling the experience of embodied agents$\unicode{x2013}$is not only possible, but can also be as effective as…
Devising intelligent agents able to live in an environment and learn by observing the surroundings is a longstanding goal of Artificial Intelligence. From a bare Machine Learning perspective, challenges arise when the agent is prevented…
Video frame interpolation is a challenging problem because there are different scenarios for each video depending on the variety of foreground and background motion, frame rate, and occlusion. It is therefore difficult for a single network…
This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional…
Imagining multiple consecutive frames given one single snapshot is challenging, since it is difficult to simultaneously predict diverse motions from a single image and faithfully generate novel frames without visual distortions. In this…
One of the most well-established applications of machine learning is in deciding what content to show website visitors. When observation data comes from high-velocity, user-generated data streams, machine learning methods perform a…
In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular, we contribute two novel methods that constitute a unified parsing…
We propose a method for learning from streaming visual data using a compact, constant size representation of all the data that was seen until a given moment. Specifically, we construct a 'coreset' representation of streaming data using a…
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video…
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal…
Attaining prototypical features to represent class distributions is well established in representation learning. However, learning prototypes online from streaming data proves a challenging endeavor as they rapidly become outdated, caused…
We propose a new method for learning videos by aggregating multiple models by sequentially extracting video clips from untrimmed video. The proposed method reduces the correlation between clips by feeding clips to multiple models in turn…
Learning the dynamics of robots from data can help achieve more accurate tracking controllers, or aid their navigation algorithms. However, when the actual dynamics of the robots change due to external conditions, on-line adaptation of…
Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth…
Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video…