Related papers: Action sequencing using visual permutations

Transformers for One-Shot Visual Imitation

Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate…

Machine Learning · Computer Science 2020-11-12 Sudeep Dasari , Abhinav Gupta

Instruction-Following Agents with Multimodal Transformer

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Hao Liu , Lisa Lee , Kimin Lee , Pieter Abbeel

Invariant recognition drives neural representations of action sequences

Recognizing the actions of others from visual stimuli is a crucial aspect of human visual perception that allows individuals to respond to social cues. Humans are able to identify similar behaviors and discriminate between distinct actions…

Neurons and Cognition · Quantitative Biology 2018-02-07 Andrea Tacchetti , Leyla Isik , Tomaso Poggio

Visual Robot Task Planning

Prospection, the act of predicting the consequences of many possible futures, is intrinsic to human planning and action, and may even be at the root of consciousness. Surprisingly, this idea has been explored comparatively little in…

Robotics · Computer Science 2018-04-03 Chris Paxton , Yotam Barnoy , Kapil Katyal , Raman Arora , Gregory D. Hager

Learning to Sequence Robot Behaviors for Visual Navigation

Recent literature in the robotics community has focused on learning robot behaviors that abstract out lower-level details of robot control. To fully leverage the efficacy of such behaviors, it is necessary to select and sequence them to…

Robotics · Computer Science 2018-03-28 Hadi Salman , Puneet Singhal , Tanmay Shankar , Peng Yin , Ali Salman , William Paivine , Guillaume Sartoretti , Matthew Travers , Howie Choset

Learning Latent Action World Models In The Wild

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain…

Artificial Intelligence · Computer Science 2026-01-21 Quentin Garrido , Tushar Nagarajan , Basile Terver , Nicolas Ballas , Yann LeCun , Michael Rabbat

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific)…

Robotics · Computer Science 2023-08-16 Jianren Wang , Sudeep Dasari , Mohan Kumar Srirama , Shubham Tulsiani , Abhinav Gupta

Learning State-Tracking from Code Using Linear RNNs

Over the last years, state-tracking tasks, particularly permutation composition, have become a testbed to understand the limits of sequence models architectures like Transformers and RNNs (linear and non-linear). However, these are often…

Machine Learning · Computer Science 2026-04-24 Julien Siems , Riccardo Grazzi , Kirill Kalinin , Hitesh Ballani , Babak Rahmani

Learning Sensorimotor Primitives of Sequential Manipulation Tasks from Visual Demonstrations

This work aims to learn how to perform complex robot manipulation tasks that are composed of several, consecutively executed low-level sub-tasks, given as input a few visual demonstrations of the tasks performed by a person. The sub-tasks…

Robotics · Computer Science 2022-03-09 Junchi Liang , Bowen Wen , Kostas Bekris , Abdeslam Boularias

Attention over learned object embeddings enables complex visual reasoning

Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 David Ding , Felix Hill , Adam Santoro , Malcolm Reynolds , Matt Botvinick

Deep Neural Networks for Visual Reasoning

Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Thao Minh Le

Deep Action- and Context-Aware Sequence Learning for Activity Recognition and Anticipation

Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and…

Computer Vision and Pattern Recognition · Computer Science 2016-11-21 Mohammad Sadegh Aliakbarian , Fatemehsadat Saleh , Basura Fernando , Mathieu Salzmann , Lars Petersson , Lars Andersson

Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering

Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. Although continual learning has been widely studied in computer vision, its application to Vision+Language tasks is not…

Machine Learning · Computer Science 2024-01-23 Mavina Nikandrou , Lu Yu , Alessandro Suglia , Ioannis Konstas , Verena Rieser

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making…

Artificial Intelligence · Computer Science 2024-06-19 Yunhao Yang , Cyrus Neary , Ufuk Topcu

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures…

Computation and Language · Computer Science 2024-02-22 Te-Lin Wu , Alex Spangher , Pegah Alipoormolabashi , Marjorie Freedman , Ralph Weischedel , Nanyun Peng

Deep Visual Foresight for Planning Robot Motion

A key challenge in scaling up robot learning to many skills and environments is removing the need for human supervision, so that robots can collect their own data and improve their own performance without being limited by the cost of…

Machine Learning · Computer Science 2017-03-14 Chelsea Finn , Sergey Levine

Action Recognition based on Cross-Situational Action-object Statistics

Machine learning models of visual action recognition are typically trained and tested on data from specific situations where actions are associated with certain objects. It is an open question how action-object associations in the training…

Computer Vision and Pattern Recognition · Computer Science 2022-08-16 Satoshi Tsutsui , Xizi Wang , Guangyuan Weng , Yayun Zhang , David Crandall , Chen Yu

The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning

In complex systems, we often observe complex global behavior emerge from a collection of agents interacting with each other in their environment, with each individual agent acting only on locally available information, without knowing the…

Neural and Evolutionary Computing · Computer Science 2021-09-30 Yujin Tang , David Ha

STEPS: A Benchmark for Order Reasoning in Sequential Tasks

Various human activities can be abstracted into a sequence of actions in natural text, i.e. cooking, repairing, manufacturing, etc. Such action sequences heavily depend on the executing order, while disorder in action sequences leads to…

Computation and Language · Computer Science 2023-06-08 Weizhi Wang , Hong Wang , Xifeng Yan

Transformers in Action Recognition: A Review on Temporal Modeling

In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed…

Computer Vision and Pattern Recognition · Computer Science 2023-02-06 Elham Shabaninia , Hossein Nezamabadi-pour , Fatemeh Shafizadegan