Computer Science

On abelian periodicity of purely morphic words

Deciding periodicity of infinite words generated by morphisms is a classical result in combinatorics on words from 80's by Harju, Linna and Pansiot. In this paper, we are interested in this question in the abelian setting. Two words are…

Discrete Mathematics · Computer Science 2026-05-29 Arina Filimonova , Svetlana Puzynina

Unveiling the Visual Counting Bottleneck in Vision-Language Models

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing…

Multimedia · Computer Science 2026-05-29 Xingzhou Pang , Yifan Hou , Junling Wang , Mrinmaya Sachan

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal…

Sound · Computer Science 2026-05-29 Bo-Han Feng , Yu-Hsuan Li Liang , Chien-Feng Liu , You-Hsuan Chang , Yun-Nung Chen

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements…

Sound · Computer Science 2026-05-29 Bohan Li , Shi Lian , Hankun Wang , Yiwei Guo , Yu Xi , Zhihan Li , Da Zheng , Colin Zhang , Kai Yu

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap…

Sound · Computer Science 2026-05-29 Yonggang Zhu , Liting Gao , Aidong Men , Wenwu Wang

State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be…

Multimedia · Computer Science 2026-05-29 Zhaoyan Pan , Xiangdong Li , Wenke Wu , Mengting Ma , Ye Lou , Ji Zhou , Jiatong Pan , Wei Zhang

Dichotomy study of the Steiner tree problem in split-like graphs

Given a connected graph $G$ and a terminal set $R \subseteq V(G)$, the minimum Steiner tree problem (ST) asks for a tree that spans all of $R$ with at most $r$ vertices from $V(G)\backslash R$, for some integer $r\geq 0$. A \emph{split…

Discrete Mathematics · Computer Science 2026-05-29 Jyothish S , Sadagopan Narasimhan

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such…

Sound · Computer Science 2026-05-29 S. Sutharya , Remya K. Sasi

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering…

Sound · Computer Science 2026-05-29 Tiantian Feng , Anfeng Xu , Xuan Shi , Aditya Kommineni , Shakhrul Iman Siam , Megan Micheletti , Zhonghao Shi , Helen Tager-Flusberg , Mi Zhang , Lynn K. Perry , Catherine Lord , Daniel Messinger , Shrikanth Narayanan

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges:…

Sound · Computer Science 2026-05-29 Tara Bogavelli , Gabrielle Gauthier Melançon , Katrina Stankiewicz , Oluwanifemi Bamgbose , Fanny Riols , Hoang H. Nguyen , Raghav Mehndiratta , Lindsay Devon Brin , Joseph Marinier , Hari Subramani , Anil Madamala , Sridhar Krishna Nemala , Srinivas Sunkara

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we…

Sound · Computer Science 2026-05-29 Harshit Rajgarhia , Shuubham Ojha , Asif Shaik , Akhil Pothanapalli , Rachuri Lokesh , Abhishek Mukherji , Prasanna Desikan

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most…

Sound · Computer Science 2026-05-29 Lekai Qian , Haoyu Gu , Jingwei Zhao , Ziyu Wang

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck:…

Sound · Computer Science 2026-05-29 Xinyuan Xie , Shunian Chen , Zhiheng Liu , Yuhao Zhang , Zhiqiang Lv , Liyin Liang , Benyou Wang

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on…

Sound · Computer Science 2026-05-29 Pengfei Zhang , Tianxin Xie , Minghao Yang , Li Liu

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a…

Sound · Computer Science 2026-05-29 Maomao Li , Zhen Li , Kaipeng Zhang , Guosheng Yin , Zhifeng Li , Dong Xu

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions…

Sound · Computer Science 2026-05-29 Fabian Retkowski , Maike Züfle , Thai Binh Nguyen , Jan Niehues , Alexander Waibel

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle…

Sound · Computer Science 2026-05-29 Yong Ren , Jingbei Li , Haiyang Sun , Yujie Chen , Cheng Yi , Yechang Huang , Hao Gu , Ye Bai , Xuerui Yang

SegTune: Structured and Fine-Grained Control for Song Generation

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting…

Sound · Computer Science 2026-05-29 Pengfei Cai , Joanna Wang , Haorui Zheng , Xu Li , Zihao Ji , Teng Ma , Zhongliang Liu , Chen Zhang , Pengfei Wan

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

Emotions conveyed through voice and face shape engagement and context in human AI interaction. Despite rapid progress in omni modal large language models, the holistic evaluation of emotional reasoning with audiovisual cues remains limited.…

Multimedia · Computer Science 2026-05-29 Dingkun Zhou , Krish Patel , Ajay Kankipati , Akshaj Gupta , Zeyi Austin Li , Mohul Shukla , Vibhor Narang , Sara Kofman , Zongli Ye , Grace Wang , Xiaoyu Shi , Tingle Li , Guan-Ting Lin , Kan Jen Cheng , Huang-Cheng Chou , Jiachen Lian , Gopala Anumanchipalli

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared…

Sound · Computer Science 2026-05-29 Lester Phillip Violeta , Xueyao Zhang , Jiatong Shi , Yusuke Yasuda , Wen-Chin Huang , Zhizheng Wu , Tomoki Toda