Related papers: Vision Encoder-Decoder Models for AI Coaching

Learning to Guide Decoding for Image Captioning

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component…

Computer Vision and Pattern Recognition · Computer Science 2018-04-04 Wenhao Jiang , Lin Ma , Xinpeng Chen , Hanwang Zhang , Wei Liu

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Shubham Toshniwal , Anjuli Kannan , Chung-Cheng Chiu , Yonghui Wu , Tara N Sainath , Karen Livescu

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale…

Computer Vision and Pattern Recognition · Computer Science 2024-09-18 Rezaul Karim , He Zhao , Richard P. Wildes , Mennatullah Siam

Instruction-Following Agents with Multimodal Transformer

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Hao Liu , Lisa Lee , Kimin Lee , Pieter Abbeel

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu , Lijuan Wang

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and…

Sound · Computer Science 2024-12-12 Sowmya Cheripally

Human-AI communication for human-human communication: Applying interpretable unsupervised anomaly detection to executive coaching

In this paper, we discuss the potential of applying unsupervised anomaly detection in constructing AI-based interactive systems that deal with highly contextual situations, i.e., human-human communication, in collaboration with domain…

Human-Computer Interaction · Computer Science 2022-06-23 Riku Arakawa , Hiromu Yakura

Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation

Unsupervised image-to-image translation is a central task in computer vision. Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by…

Computer Vision and Pattern Recognition · Computer Science 2020-03-31 Runfa Chen , Wenbing Huang , Binghui Huang , Fuchun Sun , Bin Fang

End-to-End Video Captioning

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this…

Computer Vision and Pattern Recognition · Computer Science 2019-11-11 Silvio Olivastri , Gurkirt Singh , Fabio Cuzzolin

Image to Language Understanding: Captioning approach

Extracting context from visual representations is of utmost importance in the advancement of Computer Science. Representation of such a format in Natural Language has a huge variety of applications such as helping the visually impaired etc.…

Computer Vision and Pattern Recognition · Computer Science 2020-02-25 Madhavan Seshadri , Malavika Srikanth , Mikhail Belov

Machine translation considering context information using Encoder-Decoder model

In the task of machine translation, context information is one of the important factor. But considering the context information model dose not proposed. The paper propose a new model which can integrate context information and make…

Computation and Language · Computer Science 2019-04-02 Tetsuto Takano , Satoshi Yamane

Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog

In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable progress made by existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-13 Haoyu Zhang , Meng Liu , Yisen Feng , Yaowei Wang , Weili Guan , Liqiang Nie

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Yao Qiang , Chengyin Li , Prashant Khanduri , Dongxiao Zhu

Vision Transformers for Dense Prediction

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into…

Computer Vision and Pattern Recognition · Computer Science 2021-03-26 René Ranftl , Alexey Bochkovskiy , Vladlen Koltun

Scalable AI Generative Content for Vehicular Network Semantic Communication

Perceiving vehicles in a driver's blind spot is vital for safe driving. The detection of potentially dangerous vehicles in these blind spots can benefit from vehicular network semantic communication technology. However, efficient semantic…

Artificial Intelligence · Computer Science 2023-11-27 Hao Feng , Yi Yang , Zhu Han

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Modern day conversational agents are trained to emulate the manner in which humans communicate. To emotionally bond with the user, these virtual agents need to be aware of the affective state of the user. Transformers are the recent state…

Sound · Computer Science 2022-04-26 Raman Goel , Seba Susan , Sachin Vashisht , Armaan Dhanda

Generative AI in Training and Coaching: Redefining the Design Process of Learning Materials

Generative artificial intelligence (GenAI) is transforming education, redefining the role of trainers and coaches in learning environments. In our study, we explore how AI integrates into the design process of learning materials, assessing…

Computers and Society · Computer Science 2026-03-31 Alexander Komar , Marc-André Heidelmann , Kristina Schaaff

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity…

Sound · Computer Science 2022-08-09 Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Zhen Zeng , Edward Xiao , Jing Xiao