English
Related papers

Related papers: Temporal Deformable Convolutional Encoder-Decoder …

200 papers

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this…

Computer Vision and Pattern Recognition · Computer Science 2019-11-11 Silvio Olivastri , Gurkirt Singh , Fabio Cuzzolin

Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for…

Computer Vision and Pattern Recognition · Computer Science 2015-11-12 Pingbo Pan , Zhongwen Xu , Yi Yang , Fei Wu , Yueting Zhuang

Recently Convolutional Neural Networks have been proposed for Sequence Modelling tasks such as Image Caption Generation. However, unlike Recurrent Neural Networks, the performance of Convolutional Neural Networks as Decoders for Image…

Computer Vision and Pattern Recognition · Computer Science 2021-03-09 Sulabh Katiyar , Samir Kumar Borgohain

Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural…

Computer Vision and Pattern Recognition · Computer Science 2018-05-24 Qingzhong Wang , Antoni B. Chan

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on…

Computer Vision and Pattern Recognition · Computer Science 2017-11-23 Ali Diba , Mohsen Fayyaz , Vivek Sharma , Amir Hossein Karami , Mohammad Mahdi Arzani , Rahman Yousefzadeh , Luc Van Gool

Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing…

Computer Vision and Pattern Recognition · Computer Science 2020-02-28 Aming Wu , Yahong Han

Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual…

Multimedia · Computer Science 2020-04-13 Rui Zhao , Kecheng Zheng , Zheng-jun Zha

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2019-06-05 Wei Zhang , Bairui Wang , Lin Ma , Wei Liu

Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on…

Computer Vision and Pattern Recognition · Computer Science 2021-06-03 Zhouyong Liu , Shun Luo , Wubin Li , Jingben Lu , Yufan Wu , Shilei Sun , Chunguo Li , Luxi Yang

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2018-04-02 Bairui Wang , Lin Ma , Wei Zhang , Wei Liu

Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new…

Computer Vision and Pattern Recognition · Computer Science 2020-06-18 Kun Liu , Wu Liu , Huadong Ma , Mingkui Tan , Chuang Gan

Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional…

Computation and Language · Computer Science 2018-02-16 Kalpesh Krishna , Liang Lu , Kevin Gimpel , Karen Livescu

Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions. The combination of Convolutional and Recurrent Neural Networks in…

Computer Vision and Pattern Recognition · Computer Science 2016-12-13 Álvaro Peris , Marc Bolaños , Petia Radeva , Francisco Casacuberta

Contemporary deep learning based video captioning follows encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Nayyer Aafaq , Naveed Akhtar , Wei Liu , Ajmal Mian

Unsupervised text embeddings extraction is crucial for text understanding in machine learning. Word2Vec and its variants have received substantial success in mapping words with similar syntactic or semantic meaning to vectors close to each…

Computation and Language · Computer Science 2018-05-30 Furong Huang , Animashree Anandkumar

A major obstacle to building models for effective semantic segmentation, and particularly video semantic segmentation, is a lack of large and well annotated datasets. This bottleneck is particularly prohibitive in highly specialized and…

Temporal convolutional networks (TCNs) are a commonly used architecture for temporal video segmentation. TCNs however, tend to suffer from over-segmentation errors and require additional refinement modules to ensure smoothness and temporal…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Dipika Singhania , Rahul Rahaman , Angela Yao

Remote Sensing Image Captioning (RSIC) is the process of generating meaningful descriptions from remote sensing images. Recently, it has gained significant attention, with encoder-decoder models serving as the backbone for generating…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Swadhin Das , Saarthak Gupta , Kamal Kumar , Raksha Sharma

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted…

Computer Vision and Pattern Recognition · Computer Science 2021-11-18 Teng Wang , Ruimao Zhang , Zhichao Lu , Feng Zheng , Ran Cheng , Ping Luo

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building…

Computer Vision and Pattern Recognition · Computer Science 2018-04-04 Luowei Zhou , Yingbo Zhou , Jason J. Corso , Richard Socher , Caiming Xiong
‹ Prev 1 2 3 10 Next ›