Related papers: VANPY: Voice Analysis Framework

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired…

Sound · Computer Science 2022-11-18 Hyeong-Seok Choi , Jinhyeok Yang , Juheon Lee , Hyeongju Kim

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and…

Sound · Computer Science 2025-05-21 Tiantian Feng , Jihwan Lee , Anfeng Xu , Yoonjeong Lee , Thanathai Lertpetchpun , Xuan Shi , Helin Wang , Thomas Thebaud , Laureano Moro-Velazquez , Dani Byrd , Najim Dehak , Shrikanth Narayanan

DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and…

Sound · Computer Science 2026-04-10 Yassine El Kheir , Arnab Das , Yixuan Xiao , Xin Wang , Feidi Kallel , Enes Erdem Erdogan , Ngoc Thang Vu , Tim Polzehl , Sebastian Moeller

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily…

Sound · Computer Science 2026-05-19 Jun Xue , Tong Zhang , Zhuolin Yi , Yihuan Huang , Yi Chai , Yiyang Zhang , Yanzhen Ren

SpeechPy - A Library for Speech Processing and Recognition

SpeechPy is an open source Python package that contains speech preprocessing techniques, speech features, and important post-processing operations. It provides most frequent used speech features including MFCCs and filterbank energies…

Sound · Computer Science 2018-07-25 Amirsina Torfi

VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this…

Sound · Computer Science 2023-12-25 Sarina Meyer , Xiaoxiao Miao , Ngoc Thang Vu

Shennong: a Python toolbox for audio speech features extraction

We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral…

Computation and Language · Computer Science 2023-02-09 Mathieu Bernard , Maxime Poli , Julien Karadayi , Emmanuel Dupoux

An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs

Employing voice-based emotion recognition function in artificial intelligence (AI) product will improve the user experience. Most of researches that have been done only focus on the speech collected under controlled conditions. The…

Audio and Speech Processing · Electrical Eng. & Systems 2018-03-06 Fei Tao , Gang Liu , Qingen Zhao

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features…

Sound · Computer Science 2021-10-29 Hyeong-Seok Choi , Juheon Lee , Wansoo Kim , Jie Hwan Lee , Hoon Heo , Kyogu Lee

AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-07 Tianhua Qi , Wenming Zheng , Björn W. Schuller , Zhaojie Luo , Haizhou Li

QuaPy: A Python-Based Framework for Quantification

QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that…

Machine Learning · Computer Science 2021-06-22 Alejandro Moreo , Andrea Esuli , Fabrizio Sebastiani

DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis

Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on…

Sound · Computer Science 2021-02-16 Anurag Chowdhury , Arun Ross , Prabu David

TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a…

Software Engineering · Computer Science 2024-01-03 Ashwin Prasad Shivarpatna Venkatesh , Samkutty Sabu , Jiawei Wang , Amir M. Mir , Li Li , Eric Bodden

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional…

Sound · Computer Science 2026-01-14 Runchuan Ye , Yixuan Zhou , Renjie Yu , Zijian Lin , Kehan Li , Xiang Li , Xin Liu , Guoyang Zeng , Zhiyong Wu

Learning Disentangled Speech Representations

Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-14 Yusuf Brima , Ulf Krumnack , Simone Pika , Gunther Heidemann

VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

Speech audio in the wild is often processed by post-production effects, but existing speech datasets rarely provide precise annotations of effects and parameters, limiting systematic study. We introduce VoxEffects, a speech audio effects…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-15 Zhe Zhang , Yigitcan Özer , Junichi Yamagishi

Deep Audio Analyzer: a Framework to Industrialize the Research on Audio Forensics

Deep Audio Analyzer is an open source speech framework that aims to simplify the research and the development process of neural speech processing pipelines, allowing users to conceive, compare and share results in a fast and reproducible…

Sound · Computer Science 2023-10-31 Valerio Francesco Puglisi , Oliver Giudice , Sebastiano Battiato

DigiVoice: Voice Biomarker Featurization and Analysis Pipeline

In recent years, data-driven models have enabled significant advances in medicine. Simultaneously, voice has shown potential for analysis in precision medicine as a biomarker for screening illnesses. There has been a growing trend to pursue…

Audio and Speech Processing · Electrical Eng. & Systems 2019-06-20 Larry Zhang , Xiaotong Chen , Abbad Vakil , Ali Byott , Reza Hosseini Ghomi

RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music

Vocal pitch is an important high-level feature in music audio processing. However, extracting vocal pitch in polyphonic music is more challenging due to the presence of accompaniment. To eliminate the influence of the accompaniment, most…

Sound · Computer Science 2024-01-09 Haojie Wei , Xueke Cao , Tangpeng Dan , Yueguo Chen

VAANI: Capturing the language landscape for an inclusive digital India

Voice based technologies have the potential to bridge digital accessibility gaps; however, existing datasets fail to capture the linguistic and regional diversity of Indic languages. We present Project VAANI, a large scale multimodal…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Sujith Pulikodan , Abhayjeet Singh , Agneedh Basu , Nihar Desai , Pavan Kumar J , Pranav D Bhat , Raghu Dharmaraju , Ritika Gupta , Sathvik Udupa , Saurabh Kumar , Sumit Sharma , Visruth Sanka , Dinesh Tewari , Harsh Dhand , Amrita Kamat , Sukhwinder Singh , Shikhar Vashishth , Partha Talukdar , Raj Acharya , Prasanta Kumar Ghosh