Lin Song — Scifaro

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a…

Graphics · Computer Science 2026-05-21 Lin Song , Wenbo Li , Guoqing Ma , Wei Tang , Bo Wang , Yuan Zhang , Yijun Yang , Yicheng Xiao , Jianhui Liu , Yanbing Zhang , Guohui Zhang , Wenhu Zhang , Hang Xu , Nan Jiang , Xin Han , Haoze Sun , Maoquan Zhang , Haoyang Huang , Nan Duan

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Guohui Zhang , XiaoXiao Ma , Jie Huang , Hang Xu , Hu Yu , Siming Fu , Yuming Li , Zeyue Xue , Lin Song , Haoyang Huang , Nan Duan , Feng Zhao

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Yanbing Zhang , Bo Wang , Jianhui Liu , Nan Jiang , Jiaxiu Jiang , Haoze Sun , Yijun Yang , Shenghe Zheng , Lin Song , Haoyang Huang , Nan Duan , Wenbo Li

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Yicheng Xiao , Wenhu Zhang , Lin Song , Yukang Chen , Wenbo Li , Nan Jiang , Tianhe Ren , Haokun Lin , Wei Huang , Haoyang Huang , Xiu Li , Nan Duan , Xiaojuan Qi

From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Cheng Cheng , Lin Song , Di An , Yicheng Xiao , Xuchong Zhang , Hongbin Sun , Ying Shan

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR…

Artificial Intelligence · Computer Science 2025-11-12 Songsong Yu , Yuxin Chen , Hao Ju , Lianjie Jia , Fuxi Zhang , Shaofei Huang , Yuhan Wu , Rundi Cui , Binghao Ran , Zaibin Zhang , Zhedong Zheng , Zhipeng Zhang , Yifan Wang , Lin Song , Lijun Wang , Yanwei Li , Ying Shan , Huchuan Lu

ATM-GAD: Adaptive Temporal Motif Graph Anomaly Detection for Financial Transaction Networks

Financial fraud detection is essential to safeguard billions of dollars, yet the intertwined entities and fast-changing transaction behaviors in modern financial systems routinely defeat conventional machine learning models. Recent…

Machine Learning · Computer Science 2025-08-29 Zeyue Zhang , Lin Song , Erkang Bao , Xiaoling Lv , Xinyue Wang

LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to…

Computation and Language · Computer Science 2025-06-16 Yicheng Xiao , Lin Song , Rui Yang , Cheng Cheng , Yixiao Ge , Xiu Li , Ying Shan

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation…

Artificial Intelligence · Computer Science 2025-06-12 Yicheng Xiao , Lin Song , Yukang Chen , Yingmin Luo , Yuxin Chen , Yukang Gan , Wei Huang , Xiu Li , Xiaojuan Qi , Ying Shan

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an…

Computer Vision and Pattern Recognition · Computer Science 2025-06-04 Yicheng Xiao , Lin Song , Rui Yang , Cheng Cheng , Zunnan Xu , Zhaoyang Zhang , Yixiao Ge , Xiu Li , Ying Shan

DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-26 Yuxin Yang , Yinan Zhou , Yuxin Chen , Ziqi Zhang , Zongyang Ma , Chunfeng Yuan , Bing Li , Lin Song , Jun Gao , Peng Li , Weiming Hu

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and…

Computation and Language · Computer Science 2025-03-20 Rui Yang , Lin Song , Yicheng Xiao , Runhui Huang , Yixiao Ge , Ying Shan , Hengshuang Zhao

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Yuying Ge , Sijie Zhao , Jinguo Zhu , Yixiao Ge , Kun Yi , Lin Song , Chen Li , Xiaohan Ding , Ying Shan

YOLO-UniOW: Efficient Universal Open-World Object Detection

Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Lihao Liu , Juexiao Feng , Hui Chen , Ao Wang , Lin Song , Jungong Han , Guiguang Ding

DiffTune: Auto-Tuning through Auto-Differentiation

The performance of robots in high-level tasks depends on the quality of their lower-level controller, which requires fine-tuning. However, the intrinsically nonlinear dynamics and controllers make tuning a challenging task when it is done…

Robotics · Computer Science 2024-07-12 Sheng Cheng , Minkyung Kim , Lin Song , Chengyu Yang , Yiquan Jin , Shenlong Wang , Naira Hovakimyan

GrootVL: Tree Topology is All You Need in State Space Model

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of…

Machine Learning · Computer Science 2024-06-05 Yicheng Xiao , Lin Song , Shaoli Huang , Jiangshan Wang , Siyu Song , Yixiao Ge , Xiu Li , Ying Shan

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Xiaohan Ding , Yiyuan Zhang , Yixiao Ge , Sijie Zhao , Lin Song , Xiangyu Yue , Ying Shan

InstructDET: Diversifying Referring Object Detection with Generalized Instructions

We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly…

Artificial Intelligence · Computer Science 2024-03-12 Ronghao Dang , Jiangyan Feng , Haodong Zhang , Chongjian Ge , Lin Song , Lijun Gong , Chengju Liu , Qijun Chen , Feng Zhu , Rui Zhao , Yibing Song

YOLO-World: Real-Time Open-Vocabulary Object Detection

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing…

Computer Vision and Pattern Recognition · Computer Science 2024-02-23 Tianheng Cheng , Lin Song , Yixiao Ge , Wenyu Liu , Xinggang Wang , Ying Shan

Verification of $L_1$ Adaptive Control using Verse Library: A Case Study of Quadrotors

$L_1$ adaptive control ($L_1$AC) is a control design technique that can handle a broad class of system uncertainties and provide transient performance guarantees. In this work-in-progress abstract, we discuss how existing formal…

Systems and Control · Electrical Eng. & Systems 2024-02-15 Lin Song , Yangge Li , Sheng Cheng , Pan Zhao , Sayan Mitra , Naira Hovakimyan