Mathilde Caron — Scifaro

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Despite recent advances in Vision-Language Models (VLMs), long-video understanding remains a challenging problem. Although state-of-the-art long-context VLMs can process around 1000 input frames, they still struggle to effectively leverage…

Machine Learning · Computer Science 2025-07-04 Anurag Arnab , Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Mathilde Caron , Alireza Fathi , Cordelia Schmid , Ahmet Iscen

Self-Masking Networks for Unsupervised Adaptation

With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good…

Computer Vision and Pattern Recognition · Computer Science 2024-09-13 Alfonso Taboada Warmerdam , Mathilde Caron , Yuki M. Asano

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Mathilde Caron , Ahmet Iscen , Alireza Fathi , Cordelia Schmid

Self-Supervised Learning for Endoscopic Video Analysis

Self-supervised learning (SSL) has led to important breakthroughs in computer vision by allowing learning from large amounts of unlabeled data. As such, it might have a pivotal role to play in biomedicine where annotating data requires a…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Roy Hirsch , Mathilde Caron , Regev Cohen , Amir Livne , Ron Shapiro , Tomer Golany , Roman Goldenberg , Daniel Freedman , Ehud Rivlin

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

Guided Diffusion from Self-Supervised Diffusion Features

Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Vincent Tao Hu , Yunlu Chen , Mathilde Caron , Yuki M. Asano , Cees G. M. Snoek , Bjorn Ommer

Weakly-Supervised Surgical Phase Recognition

A key element of computer-assisted surgery systems is phase recognition of surgical videos. Existing phase recognition algorithms require frame-wise annotation of a large number of videos, which is time and money consuming. In this work we…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Roy Hirsch , Regev Cohen , Mathilde Caron , Tomer Golany , Daniel Freedman , Ehud Rivlin

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT)…

Computer Vision and Pattern Recognition · Computer Science 2023-07-13 Mostafa Dehghani , Basil Mustafa , Josip Djolonga , Jonathan Heek , Matthias Minderer , Mathilde Caron , Andreas Steiner , Joan Puigcerver , Robert Geirhos , Ibrahim Alabdulmohsin , Avital Oliver , Piotr Padlewski , Alexey Gritsenko , Mario Lučić , Neil Houlsby

Verbs in Action: Improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Liliane Momeni , Mathilde Caron , Arsha Nagrani , Andrew Zisserman , Cordelia Schmid

FlexiViT: One Model for All Patch Sizes

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-27 Lucas Beyer , Pavel Izmailov , Alexander Kolesnikov , Mathilde Caron , Simon Kornblith , Xiaohua Zhai , Matthias Minderer , Michael Tschannen , Ibrahim Alabdulmohsin , Filip Pavetic

Location-Aware Self-Supervised Transformers for Semantic Segmentation

Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step to improve models on a task like semantic segmentation. However, prominent algorithms for pretraining neural networks use image-level…

Computer Vision and Pattern Recognition · Computer Science 2023-03-17 Mathilde Caron , Neil Houlsby , Cordelia Schmid

Scaling Vision Transformers to 22 Billion Parameters

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture…

Computer Vision and Pattern Recognition · Computer Science 2023-02-13 Mostafa Dehghani , Josip Djolonga , Basil Mustafa , Piotr Padlewski , Jonathan Heek , Justin Gilmer , Andreas Steiner , Mathilde Caron , Robert Geirhos , Ibrahim Alabdulmohsin , Rodolphe Jenatton , Lucas Beyer , Michael Tschannen , Anurag Arnab , Xiao Wang , Carlos Riquelme , Matthias Minderer , Joan Puigcerver , Utku Evci , Manoj Kumar , Sjoerd van Steenkiste , Gamaleldin F. Elsayed , Aravindh Mahendran , Fisher Yu , Avital Oliver , Fantine Huot , Jasmijn Bastings , Mark Patrick Collier , Alexey Gritsenko , Vighnesh Birodkar , Cristina Vasconcelos , Yi Tay , Thomas Mensink , Alexander Kolesnikov , Filip Pavetić , Dustin Tran , Thomas Kipf , Mario Lučić , Xiaohua Zhai , Daniel Keysers , Jeremiah Harmsen , Neil Houlsby

A Memory Transformer Network for Incremental Learning

We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Ahmet Iscen , Thomas Bird , Mathilde Caron , Alireza Fathi , Cordelia Schmid

Unsupervised Dense Information Retrieval with Contrastive Learning

Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and…

Information Retrieval · Computer Science 2022-08-30 Gautier Izacard , Mathilde Caron , Lucas Hosseini , Sebastian Riedel , Piotr Bojanowski , Armand Joulin , Edouard Grave

Masked Siamese Networks for Label-Efficient Learning

We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the…

Machine Learning · Computer Science 2022-04-15 Mahmoud Assran , Mathilde Caron , Ishan Misra , Piotr Bojanowski , Florian Bordes , Pascal Vincent , Armand Joulin , Michael Rabbat , Nicolas Ballas

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric…

Computer Vision and Pattern Recognition · Computer Science 2022-02-23 Priya Goyal , Quentin Duval , Isaac Seessel , Mathilde Caron , Ishan Misra , Levent Sagun , Armand Joulin , Piotr Bojanowski

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are…

Computer Vision and Pattern Recognition · Computer Science 2021-08-03 Mahmoud Assran , Mathilde Caron , Ishan Misra , Piotr Bojanowski , Armand Joulin , Nicolas Ballas , Michael Rabbat

XCiT: Cross-Covariance Image Transformers

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or…

Computer Vision and Pattern Recognition · Computer Science 2021-06-21 Alaaeldin El-Nouby , Hugo Touvron , Mathilde Caron , Piotr Bojanowski , Matthijs Douze , Armand Joulin , Ivan Laptev , Natalia Neverova , Gabriel Synnaeve , Jakob Verbeek , Hervé Jegou

ResMLP: Feedforward networks for image classification with data-efficient training

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically…

Computer Vision and Pattern Recognition · Computer Science 2021-06-11 Hugo Touvron , Piotr Bojanowski , Mathilde Caron , Matthieu Cord , Alaaeldin El-Nouby , Edouard Grave , Gautier Izacard , Armand Joulin , Gabriel Synnaeve , Jakob Verbeek , Hervé Jégou