Dylan Sam — Scifaro

When Should We Introduce Safety Interventions During Pretraining?

Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question…

Machine Learning · Computer Science 2026-02-11 Dylan Sam , Sachin Goyal , Pratyush Maini , Alexander Robey , J. Zico Kolter

Predicting the Performance of Black-box LLMs through Follow-up Queries

Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language…

Machine Learning · Computer Science 2025-12-02 Dylan Sam , Marc Finzi , J. Zico Kolter

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been…

Machine Learning · Computer Science 2025-10-22 Dylan Sam , Ayan Chakrabarti , Afshin Rostamizadeh , Srikumar Ramalingam , Gui Citovsky , Sanjiv Kumar

Safety Pretraining: Toward the Next Generation of Safe AI

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during…

Machine Learning · Computer Science 2025-09-16 Pratyush Maini , Sachin Goyal , Dylan Sam , Alex Robey , Yash Savani , Yiding Jiang , Andy Zou , Matt Fredrikson , Zacharcy C. Lipton , J. Zico Kolter

Evaluating Language Model Reasoning about Confidential Information

As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit…

Machine Learning · Computer Science 2025-08-28 Dylan Sam , Alexander Robey , Andy Zou , Matt Fredrikson , J. Zico Kolter

Finetuning CLIP to Reason about Pairwise Differences

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter

Auditing Fairness under Unobserved Confounding

Many definitions of fairness or inequity involve unobservable causal quantities that cannot be directly estimated without strong assumptions. For instance, it is particularly difficult to estimate notions of fairness that rely on…

Machine Learning · Computer Science 2024-12-10 Yewon Byun , Dylan Sam , Michael Oberst , Zachary C. Lipton , Bryan Wilder

Computing Low-Entropy Couplings for Large-Support Distributions

Minimum-entropy coupling (MEC) -- the process of finding a joint distribution with minimum entropy for given marginals -- has applications in areas such as causality and steganography. However, existing algorithms are either computationally…

Information Theory · Computer Science 2024-05-31 Samuel Sokota , Dylan Sam , Christian Schroeder de Witt , Spencer Compton , Jakob Foerster , J. Zico Kolter

Bayesian Neural Networks with Domain Knowledge Priors

Bayesian neural networks (BNNs) have recently gained popularity due to their ability to quantify model uncertainty. However, specifying a prior for BNNs that captures relevant domain knowledge is often extremely challenging. In this work,…

Machine Learning · Computer Science 2024-02-22 Dylan Sam , Rattana Pukdee , Daniel P. Jeong , Yewon Byun , J. Zico Kolter

Learning with Explanation Constraints

As larger deep learning models are hard to interpret, there has been a recent focus on generating explanations of these black-box models. In contrast, we may have apriori explanations of how models should behave. In this paper, we formalize…

Machine Learning · Computer Science 2023-12-27 Rattana Pukdee , Dylan Sam , J. Zico Kolter , Maria-Florina Balcan , Pradeep Ravikumar

Understanding prompt engineering may not require rethinking generalization

Zero-shot learning in prompted vision-language models, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly…

Machine Learning · Computer Science 2023-10-09 Victor Akinwande , Yiding Jiang , Dylan Sam , J. Zico Kolter

Losses over Labels: Weakly Supervised Learning via Direct Loss Construction

Owing to the prohibitive costs of generating large amounts of labeled data, programmatic weak supervision is a growing paradigm within machine learning. In this setting, users design heuristics that provide noisy labels for subsets of the…

Machine Learning · Computer Science 2023-10-06 Dylan Sam , J. Zico Kolter

Label Propagation with Weak Supervision

Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical…

Machine Learning · Computer Science 2023-04-11 Rattana Pukdee , Dylan Sam , Maria-Florina Balcan , Pradeep Ravikumar

Improving self-supervised representation learning via sequential adversarial masking

Recent methods in self-supervised learning have demonstrated that masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision. However, existing approaches apply random or ad hoc masking…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Dylan Sam , Min Bai , Tristan McKinney , Li Erran Li