Related papers: Temporal Deformable Convolutional Encoder-Decoder …

End-to-End Video Captioning

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this…

Computer Vision and Pattern Recognition · Computer Science 2019-11-11 Silvio Olivastri , Gurkirt Singh , Fabio Cuzzolin

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for…

Computer Vision and Pattern Recognition · Computer Science 2015-11-12 Pingbo Pan , Zhongwen Xu , Yi Yang , Fei Wu , Yueting Zhuang

Analysis of Convolutional Decoder for Image Caption Generation

Recently Convolutional Neural Networks have been proposed for Sequence Modelling tasks such as Image Caption Generation. However, unlike Recurrent Neural Networks, the performance of Convolutional Neural Networks as Decoders for Image…

Computer Vision and Pattern Recognition · Computer Science 2021-03-09 Sulabh Katiyar , Samir Kumar Borgohain

CNN+CNN: Convolutional Decoders for Image Captioning

Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural…

Computer Vision and Pattern Recognition · Computer Science 2018-05-24 Qingzhong Wang , Antoni B. Chan

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on…

Computer Vision and Pattern Recognition · Computer Science 2017-11-23 Ali Diba , Mohsen Fayyaz , Vivek Sharma , Amir Hossein Karami , Mohammad Mahdi Arzani , Rahman Yousefzadeh , Luc Van Gool

Hierarchical Memory Decoding for Video Captioning

Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing…

Computer Vision and Pattern Recognition · Computer Science 2020-02-28 Aming Wu , Yahong Han

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual…

Multimedia · Computer Science 2020-04-13 Rui Zhao , Kecheng Zheng , Zheng-jun Zha

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2019-06-05 Wei Zhang , Bairui Wang , Lin Ma , Wei Liu

ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on…

Computer Vision and Pattern Recognition · Computer Science 2021-06-03 Zhouyong Liu , Shun Luo , Wubin Li , Jingben Lu , Yufan Wu , Shilei Sun , Chunguo Li , Luxi Yang

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2018-04-02 Bairui Wang , Lin Ma , Wei Zhang , Wei Liu

A Real-time Action Representation with Temporal Encoding and Deep Compression

Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new…

Computer Vision and Pattern Recognition · Computer Science 2020-06-18 Kun Liu , Wu Liu , Huadong Ma , Mingkui Tan , Chuang Gan

A Study of All-Convolutional Encoders for Connectionist Temporal Classification

Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional…

Computation and Language · Computer Science 2018-02-16 Kalpesh Krishna , Liang Lu , Kevin Gimpel , Karen Livescu

Video Description using Bidirectional Recurrent Neural Networks

Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions. The combination of Convolutional and Recurrent Neural Networks in…

Computer Vision and Pattern Recognition · Computer Science 2016-12-13 Álvaro Peris , Marc Bolaños , Petia Radeva , Francisco Casacuberta

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Nayyer Aafaq , Naveed Akhtar , Wei Liu , Ajmal Mian

Unsupervised Learning of Word-Sequence Representations from Scratch via Convolutional Tensor Decomposition

Unsupervised text embeddings extraction is crucial for text understanding in machine learning. Word2Vec and its variants have received substantial success in mapping words with similar syntactic or semantic meaning to vectors close to each…

Computation and Language · Computer Science 2018-05-30 Furong Huang , Animashree Anandkumar

Temporally Constrained Neural Networks (TCNN): A framework for semi-supervised video semantic segmentation

A major obstacle to building models for effective semantic segmentation, and particularly video semantic segmentation, is a lack of large and well annotated datasets. This bottleneck is particularly prohibitive in highly specialized and…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Deepak Alapatt , Pietro Mascagni , Armine Vardazaryan , Alain Garcia , Nariaki Okamoto , Didier Mutter , Jacques Marescaux , Guido Costamagna , Bernard Dallemagne , Nicolas Padoy

Coarse to Fine Multi-Resolution Temporal Convolutional Network

Temporal convolutional networks (TCNs) are a commonly used architecture for temporal video segmentation. TCNs however, tend to suffer from over-segmentation errors and require additional refinement modules to ensure smoothness and temporal…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Dipika Singhania , Rahul Rahaman , Angela Yao

Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning

Remote Sensing Image Captioning (RSIC) is the process of generating meaningful descriptions from remote sensing images. Recently, it has gained significant attention, with encoder-decoder models serving as the backbone for generating…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Swadhin Das , Saarthak Gupta , Kamal Kumar , Raksha Sharma

End-to-End Dense Video Captioning with Parallel Decoding

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted…

Computer Vision and Pattern Recognition · Computer Science 2021-11-18 Teng Wang , Ruimao Zhang , Zhichao Lu , Feng Zheng , Ran Cheng , Ping Luo

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building…

Computer Vision and Pattern Recognition · Computer Science 2018-04-04 Luowei Zhou , Yingbo Zhou , Jason J. Corso , Richard Socher , Caiming Xiong