English

Modular Embedding Recomposition for Incremental Learning

Artificial Intelligence 2025-10-15 v2 Computer Vision and Pattern Recognition

Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

Keywords

Cite

@article{arxiv.2508.16463,
  title  = {Modular Embedding Recomposition for Incremental Learning},
  author = {Aniello Panariello and Emanuele Frascaroli and Pietro Buzzega and Lorenzo Bonicelli and Angelo Porrello and Simone Calderara},
  journal= {arXiv preprint arXiv:2508.16463},
  year   = {2025}
}

Comments

Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

R2 v1 2026-07-01T05:01:51.905Z