Han Lin
Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds…
Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains…
Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We…
The extent of envelope stripping in the progenitor stars is directly reflected in the diversity of spectral features observed in stripped-envelope supernovae (SESNe). Through extensive spectral observation and analysis, we aim to clarify…
While climate-induced population migration has received rising attention, the role played by human climate endeavors remains underexplored. Here, we combine machine learning with attribution mapping to analyze the impacts of 4,713…
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a…
Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic…
Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology…
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU,…
Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering…
This paper presents "Remember Me, Not Save Me," an AR & AI system enabling virtual citizens to develop personality through collective dialogue. Core innovations include: Dynamic Collective Memory (DCM) model with narrative tension…
In low-mass core-collapse supernova (CCSN) progenitors, nuclear burning beyond oxygen can become explosive under degenerate conditions, triggering eruptive mass loss before the final explosion. We investigate such pre-SN eruptions using…
Emotion recognition from electroencephalography (EEG) signals remains challenging due to high inter-subject variability, limited labeled data, and the lack of interpretable reasoning in existing approaches. While recent multimodal large…
Recent advances in meta-optics have enabled diverse functionalities in compact optical devices; however, conventional forward design approaches become inadequate as device complexity and scale grow. Inverse design offers a powerful…
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced…
We construct a new family of quantum chaotic models by combining multiple copies of integrable commuting SYK models. As each copy of the commuting SYK model does not commute with others, this construction breaks the integrability of each…
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are…
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than…
Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level…
Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in…