多媒体
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and…
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks…
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio,…
In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant inefficiencies.…
Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning.…
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many…
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal…
Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute…
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires…
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for…
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional…
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate…
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a…
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading…
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or…
This paper presents a comprehensive documentation of RenCon 2025, the revival of the expressive performance rendering competition which took place at ISMIR 2025 in Daejeon, Korea. The competition attracted 9 entries from international…
The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck.…
In recent years, live video streaming has gained widespread popularity across various social media platforms. Quality of experience (QoE), which reflects end-users' satisfaction and overall experience, plays a critical role for media…
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing…
We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing,…