Computer Science

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods…

Graphics · Computer Science 2026-05-29 Ruixiang Jiang , Chang Wen Chen

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal…

Sound · Computer Science 2026-05-29 Bo-Han Feng , Yu-Hsuan Li Liang , Chien-Feng Liu , You-Hsuan Chang , Yun-Nung Chen

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements…

Sound · Computer Science 2026-05-29 Bohan Li , Shi Lian , Hankun Wang , Yiwei Guo , Yu Xi , Zhihan Li , Da Zheng , Colin Zhang , Kai Yu

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap…

Sound · Computer Science 2026-05-29 Yonggang Zhu , Liting Gao , Aidong Men , Wenwu Wang

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such…

Sound · Computer Science 2026-05-29 S. Sutharya , Remya K. Sasi

FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes

We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to…

Graphics · Computer Science 2026-05-29 Donglai Xiang , Vismay Modi , Rishit Dagli , Ty Trusty , Gilles Daviet , Anka He Chen , Nicholas Sharp , David I. W. Levin

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering…

Sound · Computer Science 2026-05-29 Tiantian Feng , Anfeng Xu , Xuan Shi , Aditya Kommineni , Shakhrul Iman Siam , Megan Micheletti , Zhonghao Shi , Helen Tager-Flusberg , Mi Zhang , Lynn K. Perry , Catherine Lord , Daniel Messinger , Shrikanth Narayanan

F-RNG: Feed-Forward Relightable Neural Gaussians

Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input…

Graphics · Computer Science 2026-05-29 Guangming Fu , Jiahui Fan , Jian Yang , Miloš Hašan , Beibei Wang

HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In…

Graphics · Computer Science 2026-05-29 Astitva Srivastava , Hsiao-Yu Chen , Ryan Goldade , Philipp Herholz , Zhongshi Jiang , Gene Wei-Chin Lin , Lingchen Yang , Nikolaos Sarafianos , Tuur Stuyck , Doug Roble , Avinash Sharma , Egor Larionov

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges:…

Sound · Computer Science 2026-05-29 Tara Bogavelli , Gabrielle Gauthier Melançon , Katrina Stankiewicz , Oluwanifemi Bamgbose , Fanny Riols , Hoang H. Nguyen , Raghav Mehndiratta , Lindsay Devon Brin , Joseph Marinier , Hari Subramani , Anil Madamala , Sridhar Krishna Nemala , Srinivas Sunkara

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we…

Sound · Computer Science 2026-05-29 Harshit Rajgarhia , Shuubham Ojha , Asif Shaik , Akhil Pothanapalli , Rachuri Lokesh , Abhishek Mukherji , Prasanna Desikan

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most…

Sound · Computer Science 2026-05-29 Lekai Qian , Haoyu Gu , Jingwei Zhao , Ziyu Wang

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck:…

Sound · Computer Science 2026-05-29 Xinyuan Xie , Shunian Chen , Zhiheng Liu , Yuhao Zhang , Zhiqiang Lv , Liyin Liang , Benyou Wang

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on…

Sound · Computer Science 2026-05-29 Pengfei Zhang , Tianxin Xie , Minghao Yang , Li Liu

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a…

Sound · Computer Science 2026-05-29 Maomao Li , Zhen Li , Kaipeng Zhang , Guosheng Yin , Zhifeng Li , Dong Xu

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions…

Sound · Computer Science 2026-05-29 Fabian Retkowski , Maike Züfle , Thai Binh Nguyen , Jan Niehues , Alexander Waibel

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle…

Sound · Computer Science 2026-05-29 Yong Ren , Jingbei Li , Haiyang Sun , Yujie Chen , Cheng Yi , Yechang Huang , Hao Gu , Ye Bai , Xuerui Yang

Robust and Efficient Penetration-Free Elastodynamics without Barriers

We introduce a barrier-free optimization framework for non-penetration elastodynamic simulation that matches the robustness of Incremental Potential Contact (IPC) while overcoming its two primary efficiency bottlenecks: (1) reliance on…

Graphics · Computer Science 2026-05-29 Juntian Zheng , Zhaofeng Luo , Minchen Li

SegTune: Structured and Fine-Grained Control for Song Generation

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting…

Sound · Computer Science 2026-05-29 Pengfei Cai , Joanna Wang , Haorui Zheng , Xu Li , Zihao Ji , Teng Ma , Zhongliang Liu , Chen Zhang , Pengfei Wan

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared…

Sound · Computer Science 2026-05-29 Lester Phillip Violeta , Xueyao Zhang , Jiatong Shi , Yusuke Yasuda , Wen-Chin Huang , Zhizheng Wu , Tomoki Toda