多媒体 — Scifaro

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to…

多媒体 · 计算机科学 2026-03-24 Chengzhi Li , Heyan Huang , Ping Jian , Yanghao Zhou

AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former

Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal…

多媒体 · 计算机科学 2026-03-24 Liyun Zhang , Xuanmeng Sha , Shuqiong Wu , Fengkai Liu

Leum-VL Technical Report

A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes,…

多媒体 · 计算机科学 2026-03-24 Yuxuan He , Chaiming Huang , Yifan Wu , Hongjun Wang , Chenkui Shen , Jifan Zhang , Long Li

FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models

Safety filters in commercial text-to-image (T2I) models systematically block legitimate artistic content involving the human figure, treating classical nude photography with the same restrictiveness as explicit material. While prior…

多媒体 · 计算机科学 2026-03-24 Luca Cazzaniga

Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation

Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience…

多媒体 · 计算机科学 2026-03-24 Xiaosen Lyu , Jiayu Xiong , Yuren Chen , Wanlong Wang , Xiaoqing Dai , Jing Wang

Modeling the Impacts of Swipe Delay on User Quality of Experience in Short Video Streaming

Short video streaming platforms have gained immense popularity in recent years, transforming the way users consume video content. A critical aspect of user interaction with these platforms is the swipe gesture, which allows users to…

多媒体 · 计算机科学 2026-03-20 Duc V. Nguyen , Huyen T. T. Tran

Rethink Web Service Resilience in Space: A Radiation-Aware and Sustainable Transmission Solution

Low Earth Orbit (LEO) satellite networks such as Starlink and Project Kuiper are increasingly integrated with cloud infrastructures, forming an important internet backbone for global web services. By extending connectivity to remote…

多媒体 · 计算机科学 2026-03-20 Long Chen , Hao Fang , Yi Ching Chou , Haoyuan Zhao , Xiaoyi Fan , Zhe Chen , Hengzhi Wang , Jiangchuan Liu

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to…

多媒体 · 计算机科学 2026-03-20 Xinyuan Qian , Xinjia Zhu , Alessio Brutti , Dong Liang

DuoTeach: Dual Role Self-Teaching for Coarse-to-Fine Decision Coordination in Vision--Language Models

Coarse-to-fine path decision-making requires predicting a valid taxonomy path in which earlier decisions constrain later ones. However, existing benchmarks score each level independently, obscuring cross-level validity and consistency. To…

多媒体 · 计算机科学 2026-03-20 Wei Yang , Yiran Zhu , Zilin Li , Xunjia Zhang , Jun Xia , Hongtao Wang

MSM-BD: Multimodal Social Media Bot Detection Using Heterogeneous Information

Although social bots can be engineered for constructive applications, their potential for misuse in manipulative schemes and malware distribution cannot be overlooked. This dichotomy underscores the critical need to detect social bots on…

多媒体 · 计算机科学 2026-03-20 Tingxuan Wu , Zhaorui Ma , Yanjun Cui , Ziyi Zhou , Eric Wang

Beyond Forced Modality Balance: Intrinsic Information Budgets for Multimodal Learning

Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different…

多媒体 · 计算机科学 2026-03-19 Zechang Xiong , Da Li , Kexin Tang , Pengyuan Li , Wenkang Kong , Yulan Hu

Amanous: Distribution-Switching for Superhuman Piano Density on Disklavier

The automated piano enables note densities, polyphony, and register changes far beyond human physical limits, yet the three dominant traditions for composing such textures--Nancarrow's tempo canons, Xenakis's stochastic distributions, and…

多媒体 · 计算机科学 2026-03-19 Joonhyung Bae

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To…

多媒体 · 计算机科学 2026-03-18 Baohang Zhou , Kehui Song , Rize Jin , Yu Zhao , Xuhui Sui , Xinying Qian , Xingyue Guo , Ying Zhang

Visual Set Program Synthesizer

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning…

多媒体 · 计算机科学 2026-03-18 Zehua Cheng , Wei Dai , Wenhu Zhang , Thomas Lukasiewicz , Jiahao Sun

DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window…

多媒体 · 计算机科学 2026-03-18 Bingzhou Li , Tao Huang

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then…

多媒体 · 计算机科学 2026-03-18 Masato Ishii , Akio Hayakawa , Takashi Shibuya , Yuki Mitsufuji

Multimodal Cyber-physical Interaction in XR: Hybrid Doctoral Thesis Defense

Academic events, such as a doctoral thesis defense, are typically limited to either physical co-location or flat video conferencing, resulting in rigid participation formats and fragmented presence. We present a multimodal framework that…

多媒体 · 计算机科学 2026-03-17 Ahmad Alhilal , Kit Yung Lam , Lik-Hang Lee , Xuetong Wang , Sijia Li , Matti Siekkinen , Tristan Braud , Pan Hui

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly…

多媒体 · 计算机科学 2026-03-17 Lingsi Zhu , Yuefeng Zou , Yunxiang Zhang , Naixiang Zheng , Guoyuan Wang , Jun Yu , Jiaen Liang , Wei Huang , Shengping Liu , Ximin Zheng

Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design

Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified…

多媒体 · 计算机科学 2026-03-17 Yuxuan Yang , Xiaotong Mao , Jingyao Wang , Fuchun Sun

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework…

多媒体 · 计算机科学 2026-03-16 Yaofeng Su , Yuming Li , Zeyue Xue , Jie Huang , Siming Fu , Haoran Li , Ying Li , Zezhong Qian , Haoyang Huang , Nan Duan