Related papers: AssembleNet++: Assembling Modality Representations…

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time…

Computer Vision and Pattern Recognition · Computer Science 2020-05-28 Michael S. Ryoo , AJ Piergiovanni , Mingxing Tan , Anelia Angelova

ParNet: Position-aware Aggregated Relation Network for Image-Text matching

Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes…

Computer Vision and Pattern Recognition · Computer Science 2019-06-18 Yaxian Xia , Lun Huang , Wenmin Wang , Xiaoyong Wei , Wenmin Wang

EAANet: Efficient Attention Augmented Convolutional Networks

Humans can effectively find salient regions in complex scenes. Self-attention mechanisms were introduced into Computer Vision (CV) to achieve this. Attention Augmented Convolutional Network (AANet) is a mixture of convolution and…

Computer Vision and Pattern Recognition · Computer Science 2022-06-07 Runqing Zhang , Tianshu Zhu

Looking around you: external information enhances representations for event sequences

Representation learning produces models in different domains, such as store purchases, client transactions, and general people's behavior. However, such models for event sequences usually process each sequence in isolation, ignoring context…

Machine Learning · Computer Science 2026-05-29 Petr Sokerin , Maria Kovaleva , Ekaterina Boyarina , Pavel Tikhomirov , Denis Vorobiyov , Alexey Zaytsev

Attention Distillation for Learning Video Representations

We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for…

Computer Vision and Pattern Recognition · Computer Science 2020-08-18 Miao Liu , Xin Chen , Yun Zhang , Yin Li , James M. Rehg

Cross-Identity Motion Transfer for Arbitrary Objects through Pose-Attentive Video Reassembling

We propose an attention-based networks for transferring motions between arbitrary objects. Given a source image(s) and a driving video, our networks animate the subject in the source images according to the motion in the driving video. In…

Computer Vision and Pattern Recognition · Computer Science 2020-07-20 Subin Jeon , Seonghyeon Nam , Seoung Wug Oh , Seon Joo Kim

MAttNet: Modular Attention Network for Referring Expression Comprehension

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular…

Computer Vision and Pattern Recognition · Computer Science 2018-03-28 Licheng Yu , Zhe Lin , Xiaohui Shen , Jimei Yang , Xin Lu , Mohit Bansal , Tamara L. Berg

MONet: Unsupervised Scene Decomposition and Representation

The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions…

Computer Vision and Pattern Recognition · Computer Science 2019-02-01 Christopher P. Burgess , Loic Matthey , Nicholas Watters , Rishabh Kabra , Irina Higgins , Matt Botvinick , Alexander Lerchner

DCANet: Learning Connected Attentions for Convolutional Neural Networks

While self-attention mechanism has shown promising results for many vision tasks, it only considers the current features at a time. We show that such a manner cannot take full advantage of the attention mechanism. In this paper, we present…

Computer Vision and Pattern Recognition · Computer Science 2020-07-13 Xu Ma , Jingda Guo , Sihai Tang , Zhinan Qiao , Qi Chen , Qing Yang , Song Fu

FAN: Focused Attention Networks

Attention networks show promise for both vision and language tasks, by emphasizing relationships between constituent elements through weighting functions. Such elements could be regions in an image output by a region proposal network, or…

Machine Learning · Computer Science 2019-10-07 Chu Wang , Babak Samari , Vladimir Kim , Siddhartha Chaudhuri , Kaleem Siddiqi

Attentional Pooling for Action Recognition

We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks. Our proposed attention module can be trained with or without extra supervision, and gives a sizable…

Computer Vision and Pattern Recognition · Computer Science 2018-01-03 Rohit Girdhar , Deva Ramanan

Scene-Adaptive Attention Network for Crowd Counting

In recent years, significant progress has been made on the research of crowd counting. However, as the challenging scale variations and complex scenes existed in crowds, neither traditional convolution networks nor recent Transformer…

Computer Vision and Pattern Recognition · Computer Science 2022-01-03 Xing Wei , Yuanrui Kang , Jihao Yang , Yunfeng Qiu , Dahu Shi , Wenming Tan , Yihong Gong

Attention-based Part Assembly for 3D Volumetric Shape Modeling

Modeling a 3D volumetric shape as an assembly of decomposed shape parts is much more challenging, but semantically more valuable than direct reconstruction from a full shape representation. The neural network needs to implicitly learn part…

Computer Vision and Pattern Recognition · Computer Science 2023-04-24 Chengzhi Wu , Junwei Zheng , Julius Pfrommer , Jürgen Beyerer

Attention-Based Multimodal Fusion for Video Description

Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs). Recent work has shown the advantage of integrating temporal and/or spatial attention mechanisms…

Computer Vision and Pattern Recognition · Computer Science 2017-03-13 Chiori Hori , Takaaki Hori , Teng-Yok Lee , Kazuhiro Sumi , John R. Hershey , Tim K. Marks

Feature Aggregation Network for Video Face Recognition

This paper aims to learn a compact representation of a video for video face recognition task. We make the following contributions: first, we propose a meta attention-based aggregation scheme which adaptively and fine-grained weighs the…

Computer Vision and Pattern Recognition · Computer Science 2019-09-13 Zhaoxiang Liu , Huan Hu , Jinqiang Bai , Shaohua Li , Shiguo Lian

CLUENet: Cluster Attention Makes Neural Networks Have Eyes

Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Xiangshuai Song , Jun-Jie Huang , Tianrui Liu , Ke Liang , Chang Tang

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of interacting entities, ranging from agents in…

Machine Learning · Computer Science 2025-06-09 Muhammed Ustaomeroglu , Guannan Qu

Improving Multimodal Accuracy Through Modality Pre-training and Attention

Training a multimodal network is challenging and it requires complex architectures to achieve reasonable performance. We show that one reason for this phenomena is the difference between the convergence rate of various modalities. We…

Artificial Intelligence · Computer Science 2020-11-13 Aya Abdelsalam Ismail , Mahmudul Hasan , Faisal Ishtiaq

AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple…

Sound · Computer Science 2021-10-19 Ha Thi Phuong Thao , Balamurali B. T. , Dorien Herremans , Gemma Roig

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion…

Computer Vision and Pattern Recognition · Computer Science 2021-06-15 Mathilde Brousmiche , Jean Rouat , Stéphane Dupont