Related papers: Finding Fallen Objects Via Asynchronous Audio-Visu…

Object Permanence Through Audio-Visual Representations

As robots perform manipulation tasks and interact with objects, it is probable that they accidentally drop objects (e.g., due to an inadequate grasp of an unfamiliar object) that subsequently bounce out of their visual fields. To enable…

Robotics · Computer Science 2021-10-05 Fanjun Bu , Chien-Ming Huang

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward…

Robotics · Computer Science 2024-07-17 Jie Yin , Andrew Luo , Yilun Du , Anoop Cherian , Tim K. Marks , Jonathan Le Roux , Chuang Gan

SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Changan Chen , Unnat Jain , Carl Schissler , Sebastia Vicenc Amengual Gari , Ziad Al-Halah , Vamsi Krishna Ithapu , Philip Robinson , Kristen Grauman

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Chuang Gan , Yiwei Zhang , Jiajun Wu , Boqing Gong , Joshua B. Tenenbaum

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without…

Computer Vision and Pattern Recognition · Computer Science 2021-12-23 Di Hu , Yake Wei , Rui Qian , Weiyao Lin , Ruihua Song , Ji-Rong Wen

Visual Room Rearrangement

There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Luca Weihs , Matt Deitke , Aniruddha Kembhavi , Roozbeh Mottaghi

3EED: Ground Everything Everywhere in 3D

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Rong Li , Yuhao Dong , Tianshuai Hu , Ao Liang , Youquan Liu , Dongyue Lu , Liang Pan , Lingdong Kong , Junwei Liang , Ziwei Liu

RealImpact: A Dataset of Impact Sound Fields for Real Objects

Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener. While prior works have modeled impact sounds and sound propagation in simulation, we lack a standard dataset of impact…

Sound · Computer Science 2023-06-19 Samuel Clarke , Ruohan Gao , Mason Wang , Mark Rau , Julia Xu , Jui-Hsien Wang , Doug L. James , Jiajun Wu

AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments

Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization and mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some…

Robotics · Computer Science 2021-08-04 Tianwei Zhang , Huayan Zhang , Xiaofei Li , Junfeng Chen , Tin Lun Lam , Sethu Vijayakumar

Help the Blind See: Assistance for the Visually Impaired through Augmented Acoustic Simulation

An estimated 253 million people have visual impairments. These visual impairments affect everyday lives, and limit their understanding of the outside world. This can pose a risk to health from falling or collisions. We propose a solution to…

Human-Computer Interaction · Computer Science 2023-03-30 Alexander Mehta , Ritik Jalisatgi

Deep Part Induction from Articulated Object Pairs

Object functionality is often expressed through part articulation -- as when the two rigid parts of a scissor pivot against each other to perform the cutting function. Such articulations are often similar across objects within the same…

Computer Vision and Pattern Recognition · Computer Science 2018-09-21 Li Yi , Haibin Huang , Difan Liu , Evangelos Kalogerakis , Hao Su , Leonidas Guibas

Visual Acoustic Fields

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges…

Computer Vision and Pattern Recognition · Computer Science 2025-04-02 Yuelei Li , Hyunjin Kim , Fangneng Zhan , Ri-Zhao Qiu , Mazeyu Ji , Xiaojun Shan , Xueyan Zou , Paul Liang , Hanspeter Pfister , Xiaolong Wang

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw…

Sound · Computer Science 2021-11-30 Petros Giannakopoulos , Aggelos Pikrakis , Yannis Cotronis

A Dataset for Developing and Benchmarking Active Vision

We present a new public dataset with a focus on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured…

Computer Vision and Pattern Recognition · Computer Science 2017-03-07 Phil Ammirato , Patrick Poirson , Eunbyung Park , Jana Kosecka , Alexander C. Berg

Counting Stacked Objects

Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Corentin Dumery , Noa Etté , Aoxiang Fan , Ren Li , Jingyi Xu , Hieu Le , Pascal Fua

Learning Object Arrangements in 3D Scenes using Human Context

We consider the problem of learning object arrangements in a 3D scene. The key idea here is to learn how objects relate to human poses based on their affordances, ease of use and reachability. In contrast to modeling object-object…

Machine Learning · Computer Science 2012-07-03 Yun Jiang , Marcus Lim , Ashutosh Saxena

HomeEmergency -- Using Audio to Find and Respond to Emergencies in the Home

In the United States alone accidental home deaths exceed 128,000 per year. Our work aims to enable home robots who respond to emergency scenarios in the home, preventing injuries and deaths. We introduce a new dataset of household…

Robotics · Computer Science 2025-04-29 James F. Mullen , Dhruva Kumar , Xuewei Qi , Rajasimman Madhivanan , Arnie Sen , Dinesh Manocha , Richard Kim

Choosing Smartly: Adaptive Multimodal Fusion for Object Detection in Changing Environments

Object detection is an essential task for autonomous robots operating in dynamic and changing environments. A robot should be able to detect objects in the presence of sensor noise that can be induced by changing lighting conditions for…

Robotics · Computer Science 2019-11-20 Oier Mees , Andreas Eitel , Wolfram Burgard

Dynamic Objects Segmentation for Visual Localization in Urban Environments

Visual localization and mapping is a crucial capability to address many challenges in mobile robotics. It constitutes a robust, accurate and cost-effective approach for local and global pose estimation within prior maps. Yet, in highly…

Computer Vision and Pattern Recognition · Computer Science 2018-07-11 Guoxiang Zhou , Berta Bescos , Marcin Dymczyk , Mark Pfeiffer , José Neira , Roland Siegwart

Swoosh! Rattle! Thump! -- Actions that Sound

Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however, we have often…

Robotics · Computer Science 2020-07-06 Dhiraj Gandhi , Abhinav Gupta , Lerrel Pinto