多媒体 — Scifaro

Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation

Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize…

多媒体 · 计算机科学 2026-02-12 Xinyi Che , Wenbo Wang , Jian Guan , Qijun Zhao

TAROT: Towards Optimization-Driven Adaptive FEC Parameter Tuning for Video Streaming

Forward Error Correction (FEC) remains essential for protecting video streaming against packet loss, yet most real deployments still rely on static, coarse-grained configurations that cannot react to rapid shifts in loss rate, goodput, or…

多媒体 · 计算机科学 2026-02-11 Jashanjot Singh Sidhu , Aman Sahu , Abdelhak Bentaleb

Lightweight Call Signaling and Peer-to-Peer Control of WebRTC Video Conferencing

We present the software architecture and implementation of our web-based multiparty video conference application. It does not use a media server. For call signaling, it either piggybacks on existing push notifications via a lightweight…

多媒体 · 计算机科学 2026-02-10 Kundan Singh

T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring

Generative models have substantially expanded video generation capabilities, yet practical thought-to-video creation remains a multi-stage, multi-modal, and decision-intensive process. However, existing tools either hide intermediate…

多媒体 · 计算机科学 2026-02-10 Zhuoyun Zheng , Yu Dong , Gaorong Liang , Guan Li , Guihua Shan , Shiyu Cheng , Dong Tian , Jianlong Zhou , Jie Liang

Stickers on Facebook: Multifunctionality and face-enhancing politeness in everyday social interaction

Stickers are multimodal resources widely used in everyday digital conversations. Despite their popularity, most studies have focused on emojis and emoticons. Therefore, this study analyzes, from a sociopragmatic perspective, the use of…

多媒体 · 计算机科学 2026-02-10 Laura M. Porrino-Moscoso

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework…

多媒体 · 计算机科学 2026-02-10 Thu Hang Phung , Duong M. Nguyen , Thanh Trung Huynh , Quoc Viet Hung Nguyen , Trong Nghia Hoang , Phi Le Nguyen

Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space

Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models,…

多媒体 · 计算机科学 2026-02-09 Zihang Wang , Siyue Zhang , Yilun Zhao , Jingyi Yang , Tingyu Song , Anh Tuan Luu , Chen Zhao

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main…

多媒体 · 计算机科学 2026-02-06 Hanwen Zhang , Yao Liu , Peiyuan Jiang , Lang Junjie , Xie Jun , Yihui He , Yajiao Deng , Siyu Du , Qiao Liu

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which…

多媒体 · 计算机科学 2026-02-05 Zhixian Zhao , Wenjie Tian , Lei Xie

Trailer Reimagined: An Innovative, Llm-DRiven, Expressive Automated Movie Summary framework (TRAILDREAMS)

This paper introduces TRAILDREAMS, a framework that uses a large language model (LLM) to automate the production of movie trailers. The purpose of LLM is to select key visual sequences and impactful dialogues, and to help TRAILDREAMS to…

多媒体 · 计算机科学 2026-02-04 Roberto Balestri , Pasquale Cascarano , Mirko Degli Esposti , Guglielmo Pescatore

Mixture of Disentangled Experts with Missing Modalities for Robust Multimodal Sentiment Analysis

Multimodal Sentiment Analysis (MSA) integrates multiple modalities to infer human sentiment, but real-world noise often leads to missing or corrupted data. However, existing feature-disentangled methods struggle to handle the internal…

多媒体 · 计算机科学 2026-02-03 Xiang Li , Xiaoming Zhang , Dezhuang Miao , Xianfu Cheng , Dawei Li , Honggui Han , Zhoujun Li

Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection

As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21…

多媒体 · 计算机科学 2026-02-03 Chen Chen , Dion Hoe-Lian Goh

Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off:…

多媒体 · 计算机科学 2026-02-03 Mohamed Saleh , Zahra Ahmadi

Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization

This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score…

多媒体 · 计算机科学 2026-02-03 Qingcao Li , Miao He , Liang Yi , Qing Wen , Yitao Zhang , Hongshuo Jin , Peng Cheng , Zhongjie Ba , Li Lu , Kui Ren

An Automatic Deep Learning Approach for Trailer Generation through Large Language Models

Trailers are short promotional videos designed to provide audiences with a glimpse of a movie. The process of creating a trailer typically involves selecting key scenes, dialogues and action sequences from the main content and editing them…

多媒体 · 计算机科学 2026-02-02 Roberto Balestri , Pasquale Cascarano , Mirko Degli Esposti , Guglielmo Pescatore

PPVF: An Efficient Privacy-Preserving Online Video Fetching Framework with Correlated Differential Privacy

Online video streaming has evolved into an integral component of the contemporary Internet landscape. Yet, the disclosure of user requests presents formidable privacy challenges. As users stream their preferred online videos, their requests…

多媒体 · 计算机科学 2026-02-02 Xianzhi Zhang , Yipeng Zhou , Di Wu , Quan Z. Sheng , Miao Hu , Linchang Xiao

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this…

多媒体 · 计算机科学 2026-01-30 Meng Yang , Jon McCormack , Maria Teresa Llano , Wanchao Su , Chao Lei

Rethinking Fusion: Disentangled Learning of Shared and Modality-Specific Information for Stance Detection

Multi-modal stance detection (MSD) aims to determine an author's stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish…

多媒体 · 计算机科学 2026-01-30 Zhiyu Xie , Fuqiang Niu , Genan Dai , Qianlong Wang , Li Dong , Bowen Zhang , Hu Huang

HADUA: Hierarchical Attention and Dynamic Uniform Alignment for Robust Cross-Subject Emotion Recognition

Robust cross-subject emotion recognition from multimodal physiological signals remains a challenging problem, primarily due to modality heterogeneity and inter-subject distribution shift. To tackle these challenges, we propose a novel…

多媒体 · 计算机科学 2026-01-30 Jiahao Tang , Youjun Li , Yangxuan Zheng , Xiangting Fan , Siyuan Lu , Nuo Zhang , Zi-Gang Huang

Block Erasure-Aware Semantic Multimedia Compression via JSCC Autoencoder

We present an AI-based framework for semantic transmission of multimedia data over band-limited, time-varying channels. The method targets scenarios where large content is split into multiple packets, with an unknown number potentially…

多媒体 · 计算机科学 2026-01-29 Homa Esfahanizadeh , Nargis Fayaz , Jinfeng Du , Harish Viswanathan