多媒体 — Scifaro

Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality…

多媒体 · 计算机科学 2026-04-08 Chen Su , Yuanhe Tian , Yan Song

DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

Multimodal large language models (MLLMs) have shown strong capability in semantic understanding and visual reasoning, yet their use on continuous video streams in bandwidth-constrained edge-cloud systems incurs prohibitive computation and…

多媒体 · 计算机科学 2026-04-08 Qi Guo , Zheming Yang , Yunqing Hu , Chang Zhao , Wen Ji

LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations

High-quality STEM animations can be useful for learning, but they are still not common in daily teaching, mostly because they take time and special skills to make. In this paper, we present a semi-automated, human-in-the-loop (HITL)…

多媒体 · 计算机科学 2026-04-08 Aastha Joshi , Hongyi Ke , Meet Gajjar , Aaron Christian , Qi Wang , Jun Chen

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical…

多媒体 · 计算机科学 2026-04-07 Donghuo Zeng , Hao Niu , Masato Taya

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG).…

多媒体 · 计算机科学 2026-04-07 Qintong Zhang , Bin Wang , Victor Shea-Jay Huang , Junyuan Zhang , Zhengren Wang , Hao Liang , Conghui He , Wentao Zhang

Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli

Differential diagnosis of mental disorders remains a fundamental challenge in real-world clinical practice, where multiple conditions often exhibit overlapping symptoms. However, most existing public datasets are developed under…

多媒体 · 计算机科学 2026-04-06 Zhiyuan Zhou , Jingjing Wu , Zhibo Lei , Junyu Guo , Zhongcheng Yu , Yuqi Chu , Xiaowei Zhang , Qiqi Zhao , Qi Wang , Shijie Hao , Yanrong Guo , Richang Hong

Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis

Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG…

多媒体 · 计算机科学 2026-04-03 Hongjun Liu , Rujun Han , Leyu Zhou , Chao Yao

A Video Steganography for H.265/HEVC Based on Multiple CU Size and Block Structure Distortion

Video steganography based on block structure, which embeds secret information by modifying Coding Unit (CU) block structure of I-frames, is currently a research hotspot. However, the existing algorithms still suffer from the limitation of…

多媒体 · 计算机科学 2026-04-03 Xiang Zhang , Wen Jiang , Fei Peng , Wenbin Huang , Ziqiang Li , Zhangjie Fu

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual…

多媒体 · 计算机科学 2026-04-03 Minsak Nanang , Adrian Hilton , Armin Mustafa

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

Comprehending extended audiovisual experiences remains challenging for computational systems, particularly temporal integration and cross-modal associations fundamental to human episodic memory. We introduce HippoMM, a computational…

多媒体 · 计算机科学 2026-04-03 Yueqian Lin , Jingyang Zhang , Qinsi Wang , Hancheng Ye , Yuzhe Fu , Yudong Liu , Hai "Helen" Li , Yiran Chen

Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning

Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text…

多媒体 · 计算机科学 2026-04-02 Zeyu Jin , Xiaoyu Qin , Songtao Zhou , Kaifeng Yun , Jia Jia

Editing on the Generative Manifold: A Theoretical and Empirical Study of General Diffusion-Based Image Editing Trade-offs

Diffusion-based editing has rapidly evolved from curated inpainting tools into general-purpose editors spanning text-guided instruction following, mask-localized edits, drag-based geometric manipulation, exemplar transfer, and training-free…

多媒体 · 计算机科学 2026-04-01 Yi Hu , Leying Yi , Emily Davis , Finn Carter

Subjective Quality Assessment of Dynamic 3D Meshes in Virtual Reality Environment

A dynamic 3D mesh is a key component in Virtual Reality applications. However, this type of content demands a significant processing resource for real-time rendering. To reduce processing requirements while preserving the user experience,…

多媒体 · 计算机科学 2026-04-01 Duc V. Nguyen , Nguyen Thi Quynh Ly , Truong Thu Huong

Is One-Shot In-Context Learning Helpful for Data Selection in Task-Specific Fine-Tuning of Multimodal LLMs?

Injecting world knowledge into pretrained multimodal large language models (MLLMs) is essential for domain-specific applications. Task-specific fine-tuning achieves this by tailoring MLLMs to high-quality in-domain data but encounters…

多媒体 · 计算机科学 2026-03-31 Xiao An , Jiaxing Sun , Ting Hu , Wei He

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in…

多媒体 · 计算机科学 2026-03-31 Yuan Zhao , Zhenqi Jia , Yongqiang Zhang

ComVi: Context-Aware Optimized Comment Display in Video Playback

On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene,…

多媒体 · 计算机科学 2026-03-30 Minsun Kim , Dawon Lee , Junyong Noh

Cinematic Audio Source Separation Using Visual Cues

Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent…

多媒体 · 计算机科学 2026-03-30 Kang Zhang , Suyeon Lee , Arda Senocak , Joon Son Chung

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fr\'echet Gesture Distance or Beat Constancy, fail at reflecting…

多媒体 · 计算机科学 2026-03-27 Zhilin Gao , Yunhao Li , Sijing Wu , Yuqin Cao , Huiyu Duan , Guangtao Zhai

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve…

多媒体 · 计算机科学 2026-03-27 Nailei Hei , Qianyu Guo , Zihao Wang , Yan Wang , Haofen Wang , Wenqiang Zhang

Short-Form Video Viewing Behavior Analysis and Multi-Step Viewing Time Prediction

Short-form videos have become one of the most popular user-generated content formats nowadays. Popular short-video platforms use a simple streaming approach that preloads one or more videos in the recommendation list in advance. However,…

多媒体 · 计算机科学 2026-03-25 Vu Thi Hai Yen , Duc V. Nguyen , Cao Anh Minh Huy , Truong Thu Huong