English

Revisiting Multi-Task Visual Representation Learning

Computer Vision and Pattern Recognition 2026-01-21 v1

Abstract

Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

Keywords

Cite

@article{arxiv.2601.13886,
  title  = {Revisiting Multi-Task Visual Representation Learning},
  author = {Shangzhe Di and Zhonghua Zhai and Weidi Xie},
  journal= {arXiv preprint arXiv:2601.13886},
  year   = {2026}
}

Comments

Code: https://github.com/Becomebright/MTV

R2 v1 2026-07-01T09:12:22.050Z