Related papers: Do Language Models Plagiarize?

PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection

Recent studies have raised concerns about the potential threats large language models (LLMs) pose to academic integrity and copyright protection. Yet, their investigation is predominantly focused on literal copies of original texts. Also,…

Computation and Language · Computer Science 2025-02-18 Jooyoung Lee , Toshini Agrawal , Adaku Uchendu , Thai Le , Jinghui Chen , Dongwon Lee

Quantifying Memorization Across Neural Language Models

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing…

Machine Learning · Computer Science 2023-03-07 Nicholas Carlini , Daphne Ippolito , Matthew Jagielski , Katherine Lee , Florian Tramer , Chiyuan Zhang

The Next Chapter: A Study of Large Language Models in Storytelling

To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language…

Computation and Language · Computer Science 2023-07-25 Zhuohan Xie , Trevor Cohn , Jey Han Lau

Undesirable Memorization in Large Language Models: A Survey

While recent research increasingly showcases the remarkable capabilities of Large Language Models (LLMs), it is equally crucial to examine their associated risks. Among these, privacy and security vulnerabilities are particularly…

Computation and Language · Computer Science 2026-01-21 Ali Satvaty , Suzan Verberne , Fatih Turkmen

LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

In light of recent legal allegations brought by publishers, newspapers, and other creators of copyrighted corpora against large language model developers who use their copyrighted materials for training or fine-tuning purposes, we propose a…

Computation and Language · Computer Science 2024-08-05 Devam Mondal , Carlo Lipizzi

SoK: Memorization in General-Purpose Large Language Models

Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a…

Computation and Language · Computer Science 2023-10-31 Valentin Hartmann , Anshuman Suri , Vincent Bindschaedler , David Evans , Shruti Tople , Robert West

Paraphrasing with Large Language Models

Recently, large language models such as GPT-2 have shown themselves to be extremely adept at text generation and have also been able to achieve high-quality results in many downstream NLP tasks such as text classification, sentiment…

Computation and Language · Computer Science 2019-11-22 Sam Witteveen , Martin Andrews

Sources of Hallucination by Large Language Models on Inference Tasks

Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA,…

Computation and Language · Computer Science 2023-10-24 Nick McKenna , Tianyi Li , Liang Cheng , Mohammad Javad Hosseini , Mark Johnson , Mark Steedman

Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant…

Computation and Language · Computer Science 2024-10-10 Chao Zhou , Cheng Qiu , Lizhen Liang , Daniel E. Acuna

Exploring Memorization in Fine-tuned Language Models

Large language models (LLMs) have shown great capabilities in various tasks but also exhibited memorization of training data, raising tremendous privacy and copyright concerns. While prior works have studied memorization during…

Artificial Intelligence · Computer Science 2024-02-26 Shenglai Zeng , Yaxin Li , Jie Ren , Yiding Liu , Han Xu , Pengfei He , Yue Xing , Shuaiqiang Wang , Jiliang Tang , Dawei Yin

Do Language Models Know When They're Hallucinating References?

State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about…

Computation and Language · Computer Science 2024-03-21 Ayush Agrawal , Mirac Suzgun , Lester Mackey , Adam Tauman Kalai

Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias

Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) introduces a new dynamic to these practices.…

Digital Libraries · Computer Science 2024-08-27 Andres Algaba , Carmen Mazijn , Vincent Holst , Floriano Tori , Sylvia Wenmackers , Vincent Ginis

FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models

Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks, such as powering chatbots and generating stories. However, an ethical concern arises due to their potential to produce…

Computation and Language · Computer Science 2024-06-04 Kaixin Lan , Tao Fang , Derek F. Wong , Yabo Xu , Lidia S. Chao , Cecilia G. Zhao

How Large Language Models are Transforming Machine-Paraphrased Plagiarism

The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive…

Computation and Language · Computer Science 2024-02-09 Jan Philip Wahle , Terry Ruas , Frederic Kirstein , Bela Gipp

Copyright Violations and Large Language Models

Language models may memorize more than just facts, including entire chunks of texts seen during training. Fair use exemptions to copyright laws typically allow for limited use of copyrighted material without permission from the copyright…

Computation and Language · Computer Science 2023-10-24 Antonia Karamolegkou , Jiaang Li , Li Zhou , Anders Søgaard

Quantifying Memorization and Detecting Training Data of Pre-trained Language Models using Japanese Newspaper

Dominant pre-trained language models (PLMs) have demonstrated the potential risk of memorizing and outputting the training data. While this concern has been discussed mainly in English, it is also practically important to focus on…

Computation and Language · Computer Science 2024-08-16 Shotaro Ishihara , Hiromu Takahashi

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks,…

Machine Learning · Computer Science 2025-12-15 Alexander Xiong , Xuandong Zhao , Aneesh Pappu , Dawn Song

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing…

Computation and Language · Computer Science 2024-11-19 Felix B Mueller , Rebekka Görge , Anna K Bernzen , Janna C Pirk , Maximilian Poretschkin

Language Modeling and Understanding Through Paraphrase Generation and Detection

Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable…

Computation and Language · Computer Science 2026-02-25 Jan Philip Wahle

The (ab)use of Open Source Code to Train Large Language Models

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly…

Software Engineering · Computer Science 2023-03-01 Ali Al-Kaswan , Maliheh Izadi