多媒体 — Scifaro

Feedback-Driven Rate Control for Learned Video Compression

End-to-end learned video compression has achieved strong rate-distortion performance, but rate control remains underexplored, especially in target-bitrate-driven and budget-constrained scenarios. Existing methods mainly rely on explicit…

多媒体 · 计算机科学 2026-04-23 Zhiheng Xu , Xuerui Ma , Chunhua Peng , Hao Zhang

Smiling Regulates Emotion During Traumatic Recollection

We study when, where, and why 978 Holocaust survivors smile in video testimonies. We create an automatic smile detection model from facial features with an F1 of 85% and annotate detected smiles under two established taxonomies of smiling.…

多媒体 · 计算机科学 2026-04-22 Marcus Ma , Emily Zhou , Leonard Ludwig , Julia Hörath , Christina Winkler , Kleanthis Avramidis , Tiantian Feng , Gabor Toth , Alina Bothe , Shrikanth Narayanan

Multimodal Digital Sensing of Early-Life Laying Hens: A Pilot Study Integrating Thermal, Acoustic, Optical-Flow and Environmental Data

Early-life development strongly influences long-term welfare in laying hens, yet monitoring remains limited by subjective assessment and single-modality tools. This pilot study evaluated the feasibility of a multimodal sensing framework…

多媒体 · 计算机科学 2026-04-21 Yashan Dhaliwal , Daniel Essien , Suresh Neethirajan

2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite…

多媒体 · 计算机科学 2026-04-21 Zihao Zheng , Sicheng Tian , Zhihao Mao , Lingyue Zhang , Chenyue Li , Ziyun Zhang , Hong Gao , Yuchen Huang , Yutong Xu , Guojie Luo , Xiang Chen

Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality…

多媒体 · 计算机科学 2026-04-21 Rong Fu , Ziming Wang , Shuo Yin , Haiyun Wei , Kun Liu , Xianda Li , Zeli Su , Simon Fong

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through…

多媒体 · 计算机科学 2026-04-21 Akshita Gupta , Tatiana Likhomanenko , Karren Dai Yang , Richard He Bai , Zakaria Aldeneh , Navdeep Jaitly

MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection

The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic…

多媒体 · 计算机科学 2026-04-20 Yeganeh Abdollahinejad , Ahmad Mousavi , Naeemul Hassan , Kai Shu , Nathalie Japkowicz , Shahriar Khosravi , Amir Karami

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing…

多媒体 · 计算机科学 2026-04-20 Huanran Hu , Zihui Ren , Dingyi Yang , Liangyu Chen , Qixiang Gao , Tiezheng Ge , Qin Jin

Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as…

多媒体 · 计算机科学 2026-04-20 Wenhao Qian , Zhenzhen Hu , Zijie Song , Jia Li

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability…

多媒体 · 计算机科学 2026-04-17 Jianxuan Yang , Xinyue Guo , Zhi Cheng , Kai Wang , Lipan Zhang , Jinjie Hu , Qiang Ji , Yihua Cao , Yihao Meng , Zhaoyue Cui , Mengmei Liu , Meng Meng , Jian Luan

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite…

多媒体 · 计算机科学 2026-04-17 Kunlin Wu , Yanning Wang , Haofeng Tan , Boyi Chen , Teng Fei , Xianping Ma , Yang Yue , Zan Zhou , Xiaofeng Liu

Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis

Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We…

多媒体 · 计算机科学 2026-04-17 Aizierjiang Aiersilan , Mohamad Koubeissi

Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to…

多媒体 · 计算机科学 2026-04-17 Junhao Xiao , Shun Feng , Zhiyu Wu , Jinghan Yu , Haibiao Yao , Zhiyuan Ma , Jianjun Li , Youjun Bao , Yi Chen

Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

We propose CRAVE (Cluster-based Retrieval Augmented Verification with Explanation); a novel framework that integrates retrieval-augmented Large Language Models (LLMs) with clustering techniques to address fact-checking challenges on social…

多媒体 · 计算机科学 2026-04-17 Arka Ujjal Dey , Muhammad Junaid Awan , Georgia Channing , Christian Schroeder de Witt , John Collomosse

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to…

多媒体 · 计算机科学 2026-04-16 Zixuan Chen , Depeng Wang , Hao Lin , Li Luo , Ke Xu , Ya Guo , Huijia Zhu , Tanfeng Sun , Xinghao Jiang

AudioX: A Unified Framework for Anything-to-Audio Generation

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we…

多媒体 · 计算机科学 2026-04-16 Zeyue Tian , Zhaoyang Liu , Yizhu Jin , Ruibin Yuan , Liumeng Xue , Xu Tan , Qifeng Chen , Wei Xue , Yike Guo

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual…

多媒体 · 计算机科学 2026-04-13 Lingfeng Huang , Huizhong Guo , Tianjun Wei , Yingpeng Du , Zhu Sun

Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed…

多媒体 · 计算机科学 2026-04-13 Zihe Wei , Yuezun Li

QoS-QoE Translation with Large Language Model

QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their…

多媒体 · 计算机科学 2026-04-13 Yingjie Yu , Mingyuan Wu , Ahmadreza Eslaminia , Lingzhi Zhao , Kaizhuo Yan , Klara Nahrstedt

LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

Lung cancer clinical decision support demands precise reasoning across complex, multi-stage oncological workflows. Existing multimodal large language models (MLLMs) fail to handle guideline-constrained staging and treatment reasoning. We…

多媒体 · 计算机科学 2026-04-10 Fangyu Hao , Jiayu Yang , Yifan Zhu , Zijun Yu , Qicen Wu , Wang Yunlong , Jiawei Li , Yulin Liu , Xu Zeng , Guanting Chen , Shihao Li , Zhonghong Ou , Meina Song , Mengyang Sun , Haoran Luo , Yu Shi , Yingyi Wang