多媒体 — Scifaro

Period-conscious Time-series Reconstruction under Local Differential Privacy

Periodic patterns are fundamental cues in multimedia signals and systems, including repetitive motion in video (e.g., gait cycles), rhythmic and pitch-related structure in audio, and recurring textures in image sequences. When such…

多媒体 · 计算机科学 2026-05-05 Yaxuan Wang , Tianxin Li , Enji Liang , Yue Fu , Yanran Wang

Contextual Wireless Video Semantic Communication in MIMO-OFDM Systems

This paper proposes a MIMO-OFDM-based context video semantic transmission framework, namely M-CVST, for robust video communication over multi-path multiple-input multiple-output (MIMO) channels. It introduces a context-subcarrier…

多媒体 · 计算机科学 2026-05-05 Bingyan Xie , Cong Zhou , Yuxuan Shi , Biqian Feng , Yongpeng Wu , Wenjun Zhang

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other…

多媒体 · 计算机科学 2026-05-05 Mayesha Maliha R. Mithila , Mylene C. Q. Farias

PRISM: Exposing and Resolving Spurious Isolation in Federated Multimodal Continual Learning

While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing…

多媒体 · 计算机科学 2026-05-05 Beining Wu , Zihao Ding , Jun Huang

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We…

多媒体 · 计算机科学 2026-05-05 Advait Tilak , Jiwon Choi , Nazifa Mouli , Wei Le

MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

Video temporal grounding (VTG) aims to localize the start and end timestamps of the event described by a given query within an untrimmed video. Despite the strong open-world video understanding and recognition ability of video language…

多媒体 · 计算机科学 2026-05-05 Pengcheng Fang , Yuxia Chen , Xiaohao Cai

CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for…

多媒体 · 计算机科学 2026-05-04 Yawen Qin , Ke Qiu , Qin Zhang

RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance…

多媒体 · 计算机科学 2026-05-04 Nitin Choudhury , Nikhil Kumar , Aditya Kumar Sinha , Abhijeet Anand , Hossein Salemi , Orchid Chetia Phukan , Hemant Purohit , Arun Balaji Buduru

MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or…

多媒体 · 计算机科学 2026-05-01 Yang-Hao Zhou , Haitian Li , Rexar Lin , Heyan Huang , Jinxing Zhou , Changsen Yuan , Tian Lan , Ziqin Zhou , Yudong Li , Jiajun Xu , Jingyun Liao , Yi-Ming Cheng , Xuefeng Chen , Xian-Ling Mao , Yousheng Feng

OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

We introduce OpenLifelogQA, a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data. Lifelogging is the passive collection and analysis of personal daily activities using wearable devices, producing…

多媒体 · 计算机科学 2026-04-30 Quang-Linh Tran , Hoang-Bao Le , Tuong-Nghiem Diep , Binh Nguyen , Gareth J. F. Jones , Cathal Gurrin

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual…

多媒体 · 计算机科学 2026-04-29 Zhaoyan Pan , Hengyang Zhou , Xiangdong Li , Yuning Wang , Ye Lou , Jiatong Pan , Ji Zhou , Wei Zhang

Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal…

多媒体 · 计算机科学 2026-04-29 Chunlei Meng , Jiabin Luo , Pengbin Feng , Zhenglin Yan , Chengyin Hu , Zhongxue Gan , Chun Ouyang

Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation

Gesture-driven music generation is an emerging human-computer interaction paradigm for touch-free and expressive musical interaction. However, many existing approaches treat the task as isolated gesture classification or map gestures to…

多媒体 · 计算机科学 2026-04-29 Rathinaraja Jeyaraj , Barathi Subramanian , Kapilya Gangadharan , Anand Paul

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present…

多媒体 · 计算机科学 2026-04-28 Tianyidan Xie , Zhentao Huang , Mingjie Wang , Xin Huang , Jun Zhou , Minglun Gong , Zili Yi

Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors

Eye movement and memory retrieval are deeply and bidirectionally intertwined, however existing literature is generally confined to controlled lab settings. We investigate the relationship between eye gaze and memory recall in free-form…

多媒体 · 计算机科学 2026-04-27 Emily Zhou , Marcus Ma , Kleanthis Avramidis , Gabor Mihaly Toth , Shrikanth Narayanan

High-Fidelity 3D Gaussian Human Reconstruction via Region-Aware Initialization and Geometric Priors

Real-time, high-fidelity 3D human reconstruction from RGB images is essential for interactive applications such as virtual reality and gaming, yet remains challenging due to the complex non-rigid deformations of dynamic human bodies.…

多媒体 · 计算机科学 2026-04-24 Yang Liu , Zhiyong Zhang

Sema: Semantic Transport for Real-Time Multimodal Agents

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no…

多媒体 · 计算机科学 2026-04-24 Jiaying Meng , Bojie Li

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only…

多媒体 · 计算机科学 2026-04-24 Adam Cole , Mick Grierson

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical…

多媒体 · 计算机科学 2026-04-24 Dali Wang , Yunyao Zhang , Junqing Yu , Yi-Ping Phoebe Chen , Chen Xu , Zikai Song

Realistic Virtual Flood Experience System Using 360{\deg} Videos and 3D City Models Constructed from Building Footprints

Virtual flood experience systems, which enable users to vividly experience flooding, are attracting increasing attention as effective tools for communicating flood risks. However, existing systems typically rely on virtual cities that do…

多媒体 · 计算机科学 2026-04-23 Tatsuro Banno , Koki Kawada , Mizuki Takenawa , Masatoshi Denda , Kiyoharu Aizawa