Detecting Language Model Attacks with Perplexity

Gabriel Alon; Michael Kamfonas

Detecting Language Model Attacks with Perplexity

Computation and Language 2023-11-08 v3 Artificial Intelligence Cryptography and Security Machine Learning

Authors: Gabriel Alon , Michael Kamfonas

Abstract

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Detecting Language Model Attacks with Perplexity

Abstract

Keywords

Cite

Related papers