Related papers: Alignment-Aware Model Adaptation via Feedback-Guid…

Structured Gradient Guidance for Few-Shot Adaptation in Large Language Models

This paper presents a gradient-informed fine-tuning method for large language models under few-shot conditions. The goal is to enhance task adaptability and training stability when data is limited. The method builds on a base loss function…

Computation and Language · Computer Science 2025-06-03 Hongye Zheng , Yichen Wang , Ray Pan , Guiran Liu , Binrong Zhu , Hanlu Zhang

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences.…

Machine Learning · Computer Science 2023-12-04 Hanze Dong , Wei Xiong , Deepanshu Goyal , Yihan Zhang , Winnie Chow , Rui Pan , Shizhe Diao , Jipeng Zhang , Kashun Shum , Tong Zhang

Implicit Regularization in Feedback Alignment Learning Mechanisms for Neural Networks

Feedback Alignment (FA) methods are biologically inspired local learning rules for training neural networks with reduced communication between layers. While FA has potential applications in distributed and privacy-aware ML, limitations in…

Machine Learning · Computer Science 2024-06-05 Zachary Robertson , Oluwasanmi Koyejo

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that…

Machine Learning · Computer Science 2026-02-18 Max Springer , Chung Peng Lee , Blossom Metevier , Jane Castleman , Bohdan Turbal , Hayoung Jung , Zeyu Shen , Aleksandra Korolova

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded…

Machine Learning · Computer Science 2026-05-05 Sadia Asif , Mohammad Mohammadi Amiri

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a…

Computation and Language · Computer Science 2026-05-12 Jyotin Goel , Souvik Maji , Pratik Mazumder

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a…

Computation and Language · Computer Science 2026-01-21 Zixuan Liu , Siavash H. Khajavi , Guangkai Jiang , Xinru Liu

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We…

Machine Learning · Computer Science 2025-06-23 Tianle Gu , Kexin Huang , Zongqi Wang , Yixu Wang , Jie Li , Yuanqi Yao , Yang Yao , Yujiu Yang , Yan Teng , Yingchun Wang

Alignment-Aware Decoding

Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based…

Machine Learning · Computer Science 2025-10-01 Frédéric Berdoz , Luca A. Lanzendörfer , René Caky , Roger Wattenhofer

Random Feedback Alignment Algorithms to train Neural Networks: Why do they Align?

Feedback alignment algorithms are an alternative to backpropagation to train neural networks, whereby some of the partial derivatives that are required to compute the gradient are replaced by random terms. This essentially transforms the…

Machine Learning · Computer Science 2023-06-06 Dominique Chu , Florian Bacho

Task-Specific Adaptation with Restricted Model Access

The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Matan Levy , Rami Ben-Ari , Dvir Samuel , Nir Darshan , Dani Lischinski

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric…

Machine Learning · Computer Science 2025-11-25 Thong Bach , Thanh Nguyen-Tang , Dung Nguyen , Thao Minh Le , Truyen Tran

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on…

Machine Learning · Computer Science 2026-03-03 Haofeng Xu , Junwei Su , Yukun Tian , Lansong Diao , Zhengping Qian , Chuan Wu

Gradient Agreement as an Optimization Objective for Meta-Learning

This paper presents a novel optimization method for maximizing generalization over tasks in meta-learning. The goal of meta-learning is to learn a model for an agent adapting rapidly when presented with previously unseen tasks. Tasks are…

Machine Learning · Computer Science 2018-10-19 Amir Erfan Eshratifar , David Eigen , Massoud Pedram

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Fan Yang , Mingxuan Xia , Sangzhou Xia , Chicheng Ma , Hui Hui

Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation

In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream tasks.To address this challenge,we propose a novel adaptive dynamic dehazing framework that…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Yafei Zhang , Shuaitian Song , Huafeng Li , Shujuan Wang , Yu Liu

Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling

Deep generative models provide state-of-the-art performance across a wide array of applications, with recent studies showing increasing applicability for science and engineering. Despite a growing corpus of literature focused on the…

Machine Learning · Computer Science 2026-05-14 Jacob K. Christopher , James E. Warner , Ferdinando Fioretto

Adapt & Align: Continual Learning with Generative Models Latent Space Alignment

In this work, we introduce Adapt & Align, a method for continual learning of neural networks by aligning latent representations in generative models. Neural Networks suffer from abrupt loss in performance when retrained with additional…

Machine Learning · Computer Science 2023-12-22 Kamil Deja , Bartosz Cywiński , Jan Rybarczyk , Tomasz Trzciński

PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization

Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained transformers to downstream tasks. However, the optimization of tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue,…

Machine Learning · Computer Science 2026-03-09 Yao Ni , Shan Zhang , Piotr Koniusz

GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models

A promising paradigm for adapting instruction-tuned language models is to learn task-specific updates on a pretrained base model and subsequently merge them into the instruction-tuned model. However, existing approaches typically treat the…

Computation and Language · Computer Science 2026-05-05 Zhiwen Ruan , Yichao Du , Jianjie Zheng , Longyue Wang , Yun Chen , Peng Li , Jinsong Su , Yang Liu , Guanhua Chen