English
Related papers

Related papers: Vision Encoder-Decoder Models for AI Coaching

200 papers

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component…

Computer Vision and Pattern Recognition · Computer Science 2018-04-04 Wenhao Jiang , Lin Ma , Xinpeng Chen , Hanwang Zhang , Wei Liu

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Shubham Toshniwal , Anjuli Kannan , Chung-Cheng Chiu , Yonghui Wu , Tara N Sainath , Karen Livescu

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale…

Computer Vision and Pattern Recognition · Computer Science 2024-09-18 Rezaul Karim , He Zhao , Richard P. Wildes , Mennatullah Siam

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Hao Liu , Lisa Lee , Kimin Lee , Pieter Abbeel

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu , Lijuan Wang

This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and…

Sound · Computer Science 2024-12-12 Sowmya Cheripally

In this paper, we discuss the potential of applying unsupervised anomaly detection in constructing AI-based interactive systems that deal with highly contextual situations, i.e., human-human communication, in collaboration with domain…

Human-Computer Interaction · Computer Science 2022-06-23 Riku Arakawa , Hiromu Yakura

Unsupervised image-to-image translation is a central task in computer vision. Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by…

Computer Vision and Pattern Recognition · Computer Science 2020-03-31 Runfa Chen , Wenbing Huang , Binghui Huang , Fuchun Sun , Bin Fang

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this…

Computer Vision and Pattern Recognition · Computer Science 2019-11-11 Silvio Olivastri , Gurkirt Singh , Fabio Cuzzolin

Extracting context from visual representations is of utmost importance in the advancement of Computer Science. Representation of such a format in Natural Language has a huge variety of applications such as helping the visually impaired etc.…

Computer Vision and Pattern Recognition · Computer Science 2020-02-25 Madhavan Seshadri , Malavika Srikanth , Mikhail Belov

In the task of machine translation, context information is one of the important factor. But considering the context information model dose not proposed. The paper propose a new model which can integrate context information and make…

Computation and Language · Computer Science 2019-04-02 Tetsuto Takano , Satoshi Yamane

In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable progress made by existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-13 Haoyu Zhang , Meng Liu , Yisen Feng , Yaowei Wang , Weili Guan , Liqiang Nie

Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Yao Qiang , Chengyin Li , Prashant Khanduri , Dongxiao Zhu

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into…

Computer Vision and Pattern Recognition · Computer Science 2021-03-26 René Ranftl , Alexey Bochkovskiy , Vladlen Koltun

Perceiving vehicles in a driver's blind spot is vital for safe driving. The detection of potentially dangerous vehicles in these blind spots can benefit from vehicular network semantic communication technology. However, efficient semantic…

Artificial Intelligence · Computer Science 2023-11-27 Hao Feng , Yi Yang , Zhu Han

Modern day conversational agents are trained to emulate the manner in which humans communicate. To emotionally bond with the user, these virtual agents need to be aware of the affective state of the user. Transformers are the recent state…

Sound · Computer Science 2022-04-26 Raman Goel , Seba Susan , Sachin Vashisht , Armaan Dhanda

Generative artificial intelligence (GenAI) is transforming education, redefining the role of trainers and coaches in learning environments. In our study, we explore how AI integrates into the design process of learning materials, assessing…

Computers and Society · Computer Science 2026-03-31 Alexander Komar , Marc-André Heidelmann , Kristina Schaaff

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity…

Sound · Computer Science 2022-08-09 Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Zhen Zeng , Edward Xiao , Jing Xiao
‹ Prev 1 2 3 10 Next ›