Linjiang Huang — Scifaro

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Chengqi Duan , Rongyao Fang , Yuqing Wang , Kun Wang , Linjiang Huang , Xingyu Zeng , Hongsheng Li , Xihui Liu

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Pengyiang Liu , Zhongyue Shi , Hongye Hao , Qi Fu , Xueting Bi , Siwei Zhang , Xiaoyang Hu , Zitian Wang , Linjiang Huang , Si Liu

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge.…

Computer Vision and Pattern Recognition · Computer Science 2025-12-08 Hongyu Li , Manyuan Zhang , Dian Zheng , Ziyu Guo , Yimeng Jia , Kaituo Feng , Hao Yu , Yexin Liu , Yan Feng , Peng Pei , Xunliang Cai , Linjiang Huang , Hongsheng Li , Si Liu

Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Hang Xu , Linjiang Huang , Feng Zhao

FR-TTS: Test-Time Scaling for NTP-based Image Generation with Effective Filling-based Reward Signal

Test-time scaling (TTS) has become a prevalent technique in image generation, significantly boosting output quality by expanding the number of parallel samples and filtering them using pre-trained reward models. However, applying this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Hang Xu , Linjiang Huang , Feng Zhao

AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert

Multimodal Mixture-of-Experts (MoE) models offer a promising path toward scalable and efficient large vision-language systems. However, existing approaches rely on rigid routing strategies (typically activating a fixed number of experts per…

Machine Learning · Computer Science 2025-11-25 Yuting Gao , Wang Lan , Hengyuan Zhao , Linjiang Huang , Si Liu , Qingpei Guo

InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Guohui Zhang , Jiangtong Tan , Linjiang Huang , Zhonghang Yuan , Mingde Yao , Jie Huang , Feng Zhao

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Weikang Shi , Aldrich Yu , Rongyao Fang , Houxing Ren , Ke Wang , Aojun Zhou , Changyao Tian , Xinyu Fu , Yuxuan Hu , Zimu Lu , Linjiang Huang , Si Liu , Rui Liu , Hongsheng Li

Group Critical-token Policy Optimization for Autoregressive Image Generation

Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Guohui Zhang , Hu Yu , Xiaoxiao Ma , JingHao Zhang , Yaning Pan , Mingde Yao , Jie Xiao , Linjiang Huang , Feng Zhao

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Rongyao Fang , Aldrich Yu , Chengqi Duan , Linjiang Huang , Shuai Bai , Yuxuan Cai , Kun Wang , Si Liu , Xihui Liu , Hongsheng Li

VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples

Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual…

Computation and Language · Computer Science 2025-09-08 Qixin Sun , Ziqin Wang , Hengyuan Zhao , Yilin Li , Kaiyou Song , Linjiang Huang , Xiaolin Hu , Qingpei Guo , Si Liu

AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation

Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and…

Computer Vision and Pattern Recognition · Computer Science 2025-08-22 Ruipu Wu , Yige Zhang , Jinyu Chen , Linjiang Huang , Shifeng Zhang , Xu Zhou , Liang Wang , Si Liu

SkeNa: Learning to Navigate Unseen Environments Based on Abstract Hand-Drawn Maps

A typical human strategy for giving navigation guidance is to sketch route maps based on the environmental layout. Inspired by this, we introduce Sketch map-based visual Navigation (SkeNa), an embodied navigation task in which an agent must…

Robotics · Computer Science 2025-08-06 Haojun Xu , Jiaqi Xiang , Wu Wei , Jinyu Chen , Linqing Zhong , Linjiang Huang , Hongyu Yang , Si Liu

"Hi AirStar, Guide Me to the Badminton Court."

Unmanned Aerial Vehicles, operating in environments with relatively few obstacles, offer high maneuverability and full three-dimensional mobility. This allows them to rapidly approach objects and perform a wide range of tasks often…

Robotics · Computer Science 2025-07-08 Ziqin Wang , Jinyu Chen , Xiangyi Zheng , Qinan Liao , Linjiang Huang , Si Liu

FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP)…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Hang Xu , Jie Huang , Linjiang Huang , Dong Li , Yidi Liu , Feng Zhao

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often…

Computer Vision and Pattern Recognition · Computer Science 2025-05-23 Xuesong Chen , Linjiang Huang , Tao Ma , Rongyao Fang , Shaoshuai Shi , Hongsheng Li

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables…

Computer Vision and Pattern Recognition · Computer Science 2025-03-14 Rongyao Fang , Chengqi Duan , Kun Wang , Linjiang Huang , Hao Li , Shilin Yan , Hao Tian , Xingyu Zeng , Rui Zhao , Jifeng Dai , Xihui Liu , Hongsheng Li

FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering

Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting. However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Jingqiu Zhou , Lue Fan , Linjiang Huang , Xiaoyu Shi , Si Liu , Zhaoxiang Zhang , Hongsheng Li

GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

In this paper, we present GaussianPainter, the first method to paint a point cloud into 3D Gaussians given a reference image. GaussianPainter introduces an innovative feed-forward approach to overcome the limitations of time-consuming…

Computer Vision and Pattern Recognition · Computer Science 2024-12-24 Jingqiu Zhou , Lue Fan , Xuesong Chen , Linjiang Huang , Si Liu , Hongsheng Li

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based…

Computer Vision and Pattern Recognition · Computer Science 2024-09-27 Runze He , Kai Ma , Linjiang Huang , Shaofei Huang , Jialin Gao , Xiaoming Wei , Jiao Dai , Jizhong Han , Si Liu