Related papers: Multi-Source Spatial Knowledge Understanding for I…

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many…

Computer Vision and Pattern Recognition · Computer Science 2025-01-16 Rui Liu , Shuwei He , Yifan Hu , Haizhou Li

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing…

Sound · Computer Science 2025-09-04 Jiawei Zhang , Tian-Hao Zhang , Jun Wang , Jiaran Gao , Xinyuan Qian , Xu-Cheng Yin

Environment Aware Text-to-Speech Synthesis

This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous…

Audio and Speech Processing · Electrical Eng. & Systems 2022-08-09 Daxin Tan , Guangyan Zhang , Tan Lee

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Yang Liu , Ming Ma , Xiaomin Yu , Pengxiang Ding , Han Zhao , Mingyang Sun , Siteng Huang , Donglin Wang

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational…

Sound · Computer Science 2023-05-04 Jinlong Xue , Yayue Deng , Fengping Wang , Ya Li , Yingming Gao , Jianhua Tao , Jianqing Sun , Jiaen Liang

Enhancing Spatial Reasoning through Visual and Textual Thinking

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle…

Audio and Speech Processing · Electrical Eng. & Systems 2023-02-09 Li-Wei Chen , Shinji Watanabe , Alexander Rudnicky

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-23 Huadai Liu , Rongjie Huang , Xuan Lin , Wenqiang Xu , Maozong Zheng , Hong Chen , Jinzheng He , Zhou Zhao

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Yu Zhao , Hao Fei , Xiangtai Li , Libo Qin , Jiayi Ji , Hongyuan Zhu , Meishan Zhang , Min Zhang , Jianguo Wei

VisualSpeech: Enhancing Prosody Modeling in TTS Using Video

Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text…

Computation and Language · Computer Science 2025-08-19 Shumin Que , Anton Ragni

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-21 Disong Wang , Shan Yang , Dan Su , Xunying Liu , Dong Yu , Helen Meng

vTTS: visual-text to speech

This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from…

Sound · Computer Science 2022-03-29 Yoshifumi Nakano , Takaaki Saeki , Shinnosuke Takamichi , Katsuhito Sudoh , Hiroshi Saruwatari

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly…

Sound · Computer Science 2025-07-14 Neta Glazer , Aviv Navon , Yael Segal , Aviv Shamsian , Hilit Segev , Asaf Buchnick , Menachem Pirchi , Gil Hetz , Joseph Keshet

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-24 Ye-Xin Lu , Hui-Peng Du , Zheng-Yan Sheng , Yang Ai , Zhen-Hua Ling

Visual-Aware Text-to-Speech

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-22 Mohan Zhou , Yalong Bai , Wei Zhang , Ting Yao , Tiejun Zhao , Tao Mei

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate…

Computer Vision and Pattern Recognition · Computer Science 2022-03-25 Michael Hassid , Michelle Tadmor Ramanovich , Brendan Shillingford , Miaosen Wang , Ye Jia , Tal Remez

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we…

Machine Learning · Computer Science 2023-02-28 Jiyoung Lee , Joon Son Chung , Soo-Whan Chung

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and…

Sound · Computer Science 2023-06-27 Sen Liu , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Chashi Mahiul Islam , Oteo Mamo , Samuel Jacob Chacko , Xiuwen Liu , Weikuan Yu

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-21 Jaskaran Singh , Amartya Roy Chowdhury , Raghav Prabhakar , Varshul C. W