多媒体
Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal…
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework…
Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal…
While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token…
Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and…
Networked virtual reality (NVR) whiteboards are increasingly important for enabling geographically dispersed users to engage in real-time idea sharing, collaborative design, and discussion. However, latency caused by network limitations,…
Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and…
World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory…
This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing…
On-the-fly transcoding of dynamic point cloud sequences reduces storage requirements and virtually increases the number of available representations for on demand streaming scenarios. On-the-fly transcoding introduces, however, additional…
In recommendation-driven online media, creators increasingly suffer from semantic mutation, where malicious secondary edits preserve visual fidelity while altering the intended meaning. Detecting these mutations requires modeling a…
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus…
Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE)…
Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a…
Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex…
Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in…
A significant ``modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as…
Emotional and cognitive factors are essential for understanding mental health disorders. However, existing methods often treat multi-modal data as classification tasks, limiting interpretability especially for emotion and cognition.…
Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance,…
Before media playback is possible, live and video-on-demand content alike usually undergoes various operations described as tasks within a multimedia workflow. Where previously ingest, transcode, packaging and delivery tasks might have run…