多媒体 — Scifaro

Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and…

多媒体 · 计算机科学 2026-05-15 Truong Thanh Hung Nguyen , Vo Thanh Khang Nguyen , Hoang-Loc Cao , Phuc Ho , Van Pham , Hung Cao

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks…

多媒体 · 计算机科学 2026-05-15 Che Liu , Lichao Ma , Xiangyu Tony Zhang , Yuxin Zhang , Haoyang Zhang , Xuerui Yang , Fei Tian

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio,…

多媒体 · 计算机科学 2026-05-15 Jianghan Chao , Jianzhang Gao , Wenhui Tan , Yuchong Sun , Ruihua Song , Liyun Ru

Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing System

In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant inefficiencies.…

多媒体 · 计算机科学 2026-05-15 Shibo Yin , Zhiyu Zhang , Peirong Ning , Qiubo Chen , Jing Chen , Quan Zhou , Li Song

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning.…

多媒体 · 计算机科学 2026-05-13 Hayes Bai , Yinyi Luo , Wenwen Wang , Qingsong Wen , Jindong Wang

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many…

多媒体 · 计算机科学 2026-05-13 Chiyeong Heo , Jaechang Kim , Junhyuk Kwon , Hoyoung Kim , Dongmin Park , Jonghyun Lee , Jungseul Ok

RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal…

多媒体 · 计算机科学 2026-05-13 Danni Xu , Shaojing Fan , Harry Cheng , Mohan Kankanhalli

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute…

多媒体 · 计算机科学 2026-05-12 Yangneng Chen , Junlin Li , Weijun Yao , Xilai Ma , Guodong Du , Wenya Wang , Jing Li

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires…

多媒体 · 计算机科学 2026-05-12 Qijie You , Hao Liang , Mingrui Chen , Bohan Zeng , Meiyi Qiang , Zhenhao Wong , Wentao Zhang

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for…

多媒体 · 计算机科学 2026-05-12 Yifan Wang , Peiwu Wang , Yunxian Chi , Zhinan Gou , Kai Gao

Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional…

多媒体 · 计算机科学 2026-05-12 Yuxin Kong , Peng Yang , Chongbin Yi , Fan Wu , Feng Lyu

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate…

多媒体 · 计算机科学 2026-05-12 Zeyu Jin , Songtao Zhou , Haoyu Wang , Minghao Tian , Kaifeng Yun , Zhuo Chen , Xiaoyu Qin , Jia Jia

Anisotropic Modality Align

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a…

多媒体 · 计算机科学 2026-05-11 Xiaomin Yu , Yijiang Li , Yuhui Zhang , Hanzhen Zhao , Yue Yang , Hao Tang , Yue Song , Xiaobin Hu , Chengwei Qin , Shuicheng Yan , Hui Xiong

Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading…

多媒体 · 计算机科学 2026-05-08 Yan Zhuang , Minhao Liu , Yanru Zhang , Jiawen Deng , Fuji Ren

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or…

多媒体 · 计算机科学 2026-05-07 Yangchen Yu , Qian Chen , Jia Li , Zhenzhen Hu , Jinpeng Hu , Lizi Liao , Erik Cambria , Richang Hong

RenCon 2025: Revival of the Expressive Performance Rendering Competition

This paper presents a comprehensive documentation of RenCon 2025, the revival of the expressive performance rendering competition which took place at ISMIR 2025 in Daejeon, Korea. The competition attracted 9 entries from international…

多媒体 · 计算机科学 2026-05-07 Huan Zhang , Taegyun Kwon , Anders Friberg , Junyan Jiang , Hayeon Bang , Hyeyoon Cho , Gus Xia , Akira Maezawa , Simon Dixon , Dasaem Jeong

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck.…

多媒体 · 计算机科学 2026-05-07 Yida Xue , Ningyu Zhang , Tingwei Wu , Zhe Ma , Daxiong Ji , Zhao Wang , Guozhou Zheng , Huajun Chen

Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming

In recent years, live video streaming has gained widespread popularity across various social media platforms. Quality of experience (QoE), which reflects end-users' satisfaction and overall experience, plays a critical role for media…

多媒体 · 计算机科学 2026-05-07 Zehao Zhu , Wei Sun , Jun Jia , Wei Wu , Sibin Deng , Kai Li , Ying Chen , Xiongkuo Min , Jia Wang , Guangtao Zhai

Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing…

多媒体 · 计算机科学 2026-05-06 Zijian Zhao , Dian Jin , Zijing Zhou , Xiaoyu Zhang

The Streaming Reservoir Convergence Theorem: A Prospect-Theoretic Framework for Multi-Provider Adaptive Streaming

We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing,…

多媒体 · 计算机科学 2026-05-05 Justice Owusu Agyemang , Jerry John Kponyo , Kwame Opuni-Boachie Obour Agyekum , Obed Kwasi Somuah , Sarafina Serwaa Boakye , Elliot Amponsah , Godfred Manu Addo Boakye