Related papers: Audio Outperforms Text for Visual Decoding

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-10 Hugo Malard , Michel Olvera , Stéphane Lathuiliere , Slim Essid

Zero-Shot Audio Classification using Image Embeddings

Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and…

Sound · Computer Science 2022-06-13 Duygu Dogan , Huang Xie , Toni Heittola , Tuomas Virtanen

Brain-aligning of semantic vectors improves neural decoding of visual stimuli

The development of algorithms to accurately decode neural information has long been a research focus in the field of neuroscience. Brain decoding typically involves training machine learning models to map neural data onto a preestablished…

Neurons and Cognition · Quantitative Biology 2025-12-03 Shirin Vafaei , Ryohei Fukuma , Takufumi Yanagisawa , Huixiang Yang , Satoru Oshino , Naoki Tani , Hui Ming Khoo , Hidenori Sugano , Yasushi Iimura , Hiroharu Suzuki , Madoka Nakajima , Kentaro Tamura , Haruhiko Kishima

Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Hamid Abdollahi , Amir Hossein Mansouri Majoumerd , Amir Hossein Bagheri Baboukani , Amir Abolfazl Suratgar , Mohammad Bagher Menhaj

Visual Neural Decoding via Improved Visual-EEG Semantic Consistency

Visual neural decoding aims to extract and interpret original visual experiences directly from human brain activity. Recent studies have demonstrated the feasibility of decoding visual semantic categories from electroencephalography (EEG)…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Hongzhou Chen , Lianghua He , Yihang Liu , Longzhen Yang , Shaohua Shang , MengChu Zhou

Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing

Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in neuroscience and artificial intelligence. Despite notable progress in current brain decoding models,…

Neurons and Cognition · Quantitative Biology 2025-10-13 Feihan Feng , Jingxin Nie

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-22 Xianghu Yue , Xiaohai Tian , Lu Lu , Malu Zhang , Zhizheng Wu , Haizhou Li

Semantic Brain Decoding: from fMRI to conceptually similar image reconstruction of visual stimuli

Brain decoding is a field of computational neuroscience that uses measurable brain activity to infer mental states or internal representations of perceptual inputs. Therefore, we propose a novel approach to brain decoding that also relies…

Computer Vision and Pattern Recognition · Computer Science 2023-03-23 Matteo Ferrante , Tommaso Boccato , Nicola Toschi

MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive…

Machine Learning · Computer Science 2026-05-26 Zexuan Chen , Sichao Liu , Runhao Lu , Huichao Qi , Alexandra Woolgar , Xi Vincent Wang , Lihui Wang

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Arun Balajee Vasudevan , Dengxin Dai , Luc Van Gool

Efficient Multi-Modal Embeddings from Structured Data

Multi-modal word semantics aims to enhance embeddings with perceptual input, assuming that human meaning representation is grounded in sensory experience. Most research focuses on evaluation involving direct visual input, however, visual…

Computation and Language · Computer Science 2021-10-07 Anita L. Verő , Ann Copestake

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advances in functional Magnetic Resonance Imaging (fMRI) and…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Yanchen Wang , Adam Turnbull , Tiange Xiang , Yunlong Xu , Sa Zhou , Adnan Masoud , Shekoofeh Azizi , Feng Vankee Lin , Ehsan Adeli

Alternative Semantic Representations for Zero-Shot Human Action Recognition

A proper semantic representation for encoding side information is key to the success of zero-shot learning. In this paper, we explore two alternative semantic representations especially for zero-shot human action recognition: textual…

Computer Vision and Pattern Recognition · Computer Science 2017-06-29 Qian Wang , Ke Chen

Exploring the Role of Audio in Video Captioning

Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches…

Computer Vision and Pattern Recognition · Computer Science 2023-06-23 Yuhan Shen , Linjie Yang , Longyin Wen , Haichao Yu , Ehsan Elhamifar , Heng Wang

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they…

Computer Vision and Pattern Recognition · Computer Science 2024-07-24 Xiangyan Qu , Jing Yu , Keke Gai , Jiamin Zhuang , Yuanmin Tang , Gang Xiong , Gaopeng Gou , Qi Wu

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Biological research has revealed that the verbal semantic information in the brain cortex, as an additional source, participates in nonverbal semantic tasks, such as visual encoding. However, previous visual encoding models did not…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Shuxiao Ma , Linyuan Wang , Bin Yan

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of…

Computer Vision and Pattern Recognition · Computer Science 2019-10-28 Tanzila Rahman , Bicheng Xu , Leonid Sigal

Decoding Visual Neural Representations by Multimodal with Dynamic Balancing

In this work, we propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals. Specifically, we introduce text modality to enhance the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-04 Kaili sun , Xingyu Miao , Bing Zhai , Haoran Duan , Yang Long

SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

We present SEED (Semantic Evaluation for Visual Brain Decoding), a novel metric for evaluating the semantic decoding performance of visual brain decoding models. It integrates three complementary metrics, each capturing a different aspect…

Computer Vision and Pattern Recognition · Computer Science 2026-02-25 Juhyeon Park , Peter Yongho Kim , Jiook Cha , Shinjae Yoo , Taesup Moon

Aligning brain functions boosts the decoding of visual semantics in novel subjects

Deep learning is leading to major advances in the realm of brain decoding from functional Magnetic Resonance Imaging (fMRI). However, the large inter-subject variability in brain characteristics has limited most studies to train models on…

Machine Learning · Computer Science 2023-12-12 Alexis Thual , Yohann Benchetrit , Felix Geilert , Jérémy Rapin , Iurii Makarov , Hubert Banville , Jean-Rémi King