多媒体 — Scifaro

Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation

Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal…

多媒体 · 计算机科学 2026-03-13 Yubeen Lee , Sangeun Lee , Junyeop Cha , Eunil Park

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework…

多媒体 · 计算机科学 2026-03-13 Inyong Koo , yeeun Seong , Minseok Son , Jaehyuk Jang , Changick Kim

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal…

多媒体 · 计算机科学 2026-03-12 Yunsheng Wang , Yuntao Shou , Yilong Tan , Wei Ai , Tao Meng , Keqin Li

Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token…

多媒体 · 计算机科学 2026-03-12 Dongxu Zhang , Yiding Sun , Cheng Tan , Wenbiao Yan , Ning Yang , Jihua Zhu , Haijun Zhang

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and…

多媒体 · 计算机科学 2026-03-11 Xiang Yuan , Xu Chu , Xinrong Chen , Haochen Li , Zonghong Dai , Hongcheng Fan , Xiaoyue Yuan , Weiping Li , Tong Mo

Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards

Networked virtual reality (NVR) whiteboards are increasingly important for enabling geographically dispersed users to engage in real-time idea sharing, collaborative design, and discussion. However, latency caused by network limitations,…

多媒体 · 计算机科学 2026-03-11 Jiarun Song , Yongkang Hou , Fuzheng Yang

TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration

Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and…

多媒体 · 计算机科学 2026-03-11 Jiarun Song , Ninghao Wan , Fuzheng Yang , Weisi Lin

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory…

多媒体 · 计算机科学 2026-03-11 Jiahua Wang , Leqi Zheng , Jialong Wu , Yaoxin Mao

AI Blob! LLM-Driven Recontextualization of Italian Television Archives

This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing…

多媒体 · 计算机科学 2026-03-11 Roberto Balestri

Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds

On-the-fly transcoding of dynamic point cloud sequences reduces storage requirements and virtually increases the number of available representations for on demand streaming scenarios. On-the-fly transcoding introduces, however, additional…

多媒体 · 计算机科学 2026-03-10 Michael Rudolph , Matthias De Fré , Finn Schnier , Tim Wauters , Amr Rizk

Q-BAR: Blogger Anomaly Recognition via Quantum-enhanced Manifold Learning

In recommendation-driven online media, creators increasingly suffer from semantic mutation, where malicious secondary edits preserve visual fidelity while altering the intended meaning. Detecting these mutations requires modeling a…

多媒体 · 计算机科学 2026-03-10 Maida Wang , Panyun Jiang

Taming Modality Entanglement in Continual Audio-Visual Segmentation

Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus…

多媒体 · 计算机科学 2026-03-10 Yuyang Hong , Qi Yang , Tao Zhang , Zili Wang , Zhaojin Fu , Kun Ding , Bin Fan , Shiming Xiang

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE)…

多媒体 · 计算机科学 2026-03-09 Kin Wai Lau , Yasar Abbas Ur Rehman , Lai-Man Po , Pedro Porto Buarque de Gusmão

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a…

多媒体 · 计算机科学 2026-03-06 Zhu Li , Yongjian Chen , Huiyuan Lai , Xiyuan Gao , Shekhar Nayak , Matt Coler

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex…

多媒体 · 计算机科学 2026-03-05 Qianrui Zhou , Hua Xu , Yunjin Gu , Yifan Wang , Songze Li , Hanlei Zhang

Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling

Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in…

多媒体 · 计算机科学 2026-03-04 Wei Jiang , Tong Chen , Wei Yuan , Quoc Viet Hung Nguyen , Hongzhi Yin

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

A significant ``modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as…

多媒体 · 计算机科学 2026-03-04 Yuesheng Huang , Peng Zhang , Xiaoxin Wu , Riliang Liu , Jiaqi Liang

Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding

Emotional and cognitive factors are essential for understanding mental health disorders. However, existing methods often treat multi-modal data as classification tasks, limiting interpretability especially for emotion and cognition.…

多媒体 · 计算机科学 2026-03-03 Zhiyuan Zhou , Yanrong Guo , Shijie Hao

CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance,…

多媒体 · 计算机科学 2026-03-03 Jiadong Wang , Ke Zhang , Xinyuan Qian , Ruijie Tao , Haizhou Li , Björn Schuller

Nagare Media Engine: A System for Cloud- and Edge-Native Network-based Multimedia Workflows

Before media playback is possible, live and video-on-demand content alike usually undergoes various operations described as tasks within a multimedia workflow. Where previously ingest, transcode, packaging and delivery tasks might have run…

多媒体 · 计算机科学 2026-03-03 Matthias Neugebauer