Computer Science

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal…

Sound · Computer Science 2026-05-29 Bo-Han Feng , Yu-Hsuan Li Liang , Chien-Feng Liu , You-Hsuan Chang , Yun-Nung Chen

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements…

Sound · Computer Science 2026-05-29 Bohan Li , Shi Lian , Hankun Wang , Yiwei Guo , Yu Xi , Zhihan Li , Da Zheng , Colin Zhang , Kai Yu

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap…

Sound · Computer Science 2026-05-29 Yonggang Zhu , Liting Gao , Aidong Men , Wenwu Wang

The Complexity of Verifying Feedforward Neural Networks in Quantised Settings

We investigate the computational complexity of neural network verification in quantised settings. We distinguish three classes of Feedforward Neural Networks (FNNs): rational FNNs with exact rational weights, quantised FNNs whose weights…

Computational Complexity · Computer Science 2026-05-29 Eric Alsmann , Martin Lange , Marco Sälzer

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such…

Sound · Computer Science 2026-05-29 S. Sutharya , Remya K. Sasi

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering…

Sound · Computer Science 2026-05-29 Tiantian Feng , Anfeng Xu , Xuan Shi , Aditya Kommineni , Shakhrul Iman Siam , Megan Micheletti , Zhonghao Shi , Helen Tager-Flusberg , Mi Zhang , Lynn K. Perry , Catherine Lord , Daniel Messinger , Shrikanth Narayanan

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges:…

Sound · Computer Science 2026-05-29 Tara Bogavelli , Gabrielle Gauthier Melançon , Katrina Stankiewicz , Oluwanifemi Bamgbose , Fanny Riols , Hoang H. Nguyen , Raghav Mehndiratta , Lindsay Devon Brin , Joseph Marinier , Hari Subramani , Anil Madamala , Sridhar Krishna Nemala , Srinivas Sunkara

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we…

Sound · Computer Science 2026-05-29 Harshit Rajgarhia , Shuubham Ojha , Asif Shaik , Akhil Pothanapalli , Rachuri Lokesh , Abhishek Mukherji , Prasanna Desikan

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most…

Sound · Computer Science 2026-05-29 Lekai Qian , Haoyu Gu , Jingwei Zhao , Ziyu Wang

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck:…

Sound · Computer Science 2026-05-29 Xinyuan Xie , Shunian Chen , Zhiheng Liu , Yuhao Zhang , Zhiqiang Lv , Liyin Liang , Benyou Wang

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on…

Sound · Computer Science 2026-05-29 Pengfei Zhang , Tianxin Xie , Minghao Yang , Li Liu

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a…

Sound · Computer Science 2026-05-29 Maomao Li , Zhen Li , Kaipeng Zhang , Guosheng Yin , Zhifeng Li , Dong Xu

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions…

Sound · Computer Science 2026-05-29 Fabian Retkowski , Maike Züfle , Thai Binh Nguyen , Jan Niehues , Alexander Waibel

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle…

Sound · Computer Science 2026-05-29 Yong Ren , Jingbei Li , Haiyang Sun , Yujie Chen , Cheng Yi , Yechang Huang , Hao Gu , Ye Bai , Xuerui Yang

SegTune: Structured and Fine-Grained Control for Song Generation

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting…

Sound · Computer Science 2026-05-29 Pengfei Cai , Joanna Wang , Haorui Zheng , Xu Li , Zihao Ji , Teng Ma , Zhongliang Liu , Chen Zhang , Pengfei Wan

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared…

Sound · Computer Science 2026-05-29 Lester Phillip Violeta , Xueyao Zhang , Jiatong Shi , Yusuke Yasuda , Wen-Chin Huang , Zhizheng Wu , Tomoki Toda

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise…

Sound · Computer Science 2026-05-29 Ragib Amin Nihal , Benjamin Yen , Runwu Shi , Takeshi Ashizawa , Kazuhiro Nakadai

Simulating Polynomial-Time Nondeterministic Turing Machines via Nondeterministic Turing Machines

We prove in this paper that there is a language $L_s$ accepted by some nondeterministic Turing machine that runs within time $O(n^k)$ for any positive integer $k\in\mathbb{N}_1$ but not by any ${\rm co}\mathcal{NP}$ machines. Then we…

Computational Complexity · Computer Science 2026-05-29 Tianrong Lin

Cross-modal characterization of infant cry: validation of a chest-surface accelerometer in extracting acoustic vocal function measures

Background: Infant cry acoustics provide a promising window into early neurodevelopment and may serve as scalable biomarkers for neurodevelopmental disorders. However, conventional microphone-based recordings are highly susceptible to…

Sound · Computer Science 2026-05-28 Winko W. An , Saketh Sundar , Lisa Yankowitz , Daryush D. Mehta , Carol L. Wilkinson

DEMON: Diffusion Engine for Musical Orchestrated Noise

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking…

Sound · Computer Science 2026-05-28 Ryan Fosdick