Yichen Han
Prompt engineering is crucial for fully leveraging large language models (LLMs), yet most existing optimization methods follow a single trajectory, resulting in limited adaptability, gradient conflicts, and high computational overhead. We…
Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the…
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and…
Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with…
While standalone Voice Assistants (VAs) are promising to support older adults' daily routine and wellbeing management, onboarding and setting up these devices can be challenging. Although some older adults choose to seek assistance from…
Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective…
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context…
While voice user interfaces offer increased accessibility due to hands-free and eyes-free interactions, older adults often have challenges such as constructing structured requests and perceiving how such devices operate. Voice-first user…
Audio driven talking head synthesis is a challenging task that attracts increasing attention in recent years. Although existing methods based on 2D landmarks or 3D face models can synthesize accurate lip synchronization and rhythmic head…
Population aging is an increasingly important consideration for health care in the 21th century, and continuing to have access and interact with digital health information is a key challenge for aging populations. Voice-based Intelligent…
In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker…