Industrial Technology Correspondent
The debate around generative AI is no longer just about whether a system provides correct answers. The tougher question is whether, when referring to external texts, it serves the same function as a human citation—or if it merely appears convincing in language without carrying the cultural and legal weight of citation. It is precisely here where technology, copyright law, and user expectations intersect.
The U.S. Copyright Office has explicitly framed this debate in several parts of its artificial intelligence report at the level of established principles.[1][7][9] Part two states that existing copyright rules are flexible enough to cover generative AI; at the same time, it clarifies that AI outputs are only protectable if a human sufficiently determines the expressive elements.[7][9] This is important for the citation question because the office thereby draws a line: Not every machine-generated textual similarity is an independent creative work.
The dispute between The New York Times and OpenAI sharpens this line further.[2][5][8][10] According to publicly known allegations, it’s not just about training on journalistic texts but also about claims that system outputs sometimes reproduce passages from articles almost verbatim and could thus serve as substitutes for the original.[2][5][8][10] OpenAI, in contrast, points to fair use and asserts that the models are not meant to be direct substitutes for newspaper content.[2][5][8] Legally, this leaves a core question open: Is a model that in some places stays very close to the original still a search and generation system—or already a distribution channel for others’ content?
For technical classification, RAG, or Retrieval-Augmented Generation, is a useful counterconcept.[3][11][12] This method combines a language model with external search and aims to provide answers based on a verifiable source basis.[3][11][12] In descriptions of such systems, this point is emphasized: they can provide sources that users can check, thereby building trust.[11][12] But this is still not the same as a citation in the human sense. A RAG system can show evidence without "understanding" why a citation is marked, delimited, and contextualized in scientific or journalistic practice.
That is why the confusion between source reference and citation is so persistent. Humans cite to make origin, authority, and boundaries visible; the practice involves responsibility. A model, by contrast, combines patterns from training, retrieval, and generation.[1][11][12] It can output origin signals without having an intent to cite.[1][11][12] While this difference may sound semantic, it is industrially important: product teams are currently building interfaces meant to generate trust and often run into the expectation that a list of sources already replaces a form of editorial diligence.
On the other hand, there are authors and publishers who consider this assumption dangerous.[4][6] In the statements presented, it is argued that unlicensed use of creative works in training puts creators' livelihoods under pressure and cannot be dismissed simply as technical processing.[4][6] This lies at the economic core of the debate: training a model with others’ texts does not just produce mathematical parameters but also shifts the bargaining power regarding licensing, compensation, and visibility. This is particularly sensitive for news and specialized content because their economic basis depends on proper attribution.
Nevertheless, it remains unclear where exactly the boundary lies between permissible reconstruction and impermissible appropriation. Current sources show mainly two things: first, that courts and authorities do not want to treat generative AI as an exception; second, that the evidentiary question is technically challenging.[1][7][9][10] A single incident with an almost verbatim excerpt says little about the entire system.[2][10] To judge more reliably, more precise data would be needed on how often such outputs occur, under which prompt conditions they appear, and whether they can be deliberately reproduced.
Precisely for that reason, the question of "citing" in AI is also a question of product architecture. If a system only generates statements without cleanly separating origins, the source reference is often more decoration than proof. Conversely, if it works based on search, makes evidence visible, and discloses the boundary between training and external sources, it at least approaches the function users expect from a citation.[3][11][12] The challenge is rarely the model alone; it is the integration of retrieval, display, licensing, and liability in a system that should appear simple to end users.
From a European perspective, this is more than a U.S. legal dispute with industry ties. Once AI systems are incorporated into editorial operations, knowledge databases, legal applications, or industrial documentation chains, the way sources are handled simultaneously decides trust and risk.[3][6][7][9] A misattributed reference there is not just a stylistic problem but can affect processes, verification chains, and liability issues. Therefore, one should not ask about the glamorous term 'citation,' but rather about the more reliable practice: Who provides the source, who verifies it, and what happens if the system remains too close to the original? Exactly these questions will sustain the debate over AI and copyright longer than any quick answer on the screen.
References
References
Small numbered tags in the article body point to the sources below.
- [PDF] Copyright and Artificial Intelligence, Part 2 Copyrightability Report
- OpenAI Claps Back at NYT Lawsuit
- Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use
- [PDF] Copyright and Artificial Intelligence, Part 3: Generative AI Training ...
- [PDF] The New York Times, OpenAI, and the Copyright Implications of AI ...
- May 3, 2024 Via E-Mail Suzanne Wilson General Counsel ...
- Copyright Office Releases Part 2 of Artificial Intelligence Report
- Stolen Stories or Fair Use? The New York Times v. OpenAI and the Limits of Machine Learning — Columbia Undergraduate Law Review
- Copyright and Artificial Intelligence | U.S. Copyright Office
- Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit
- What Is Retrieval-Augmented Generation aka RAG - NVIDIA Blog
- Aman's AI Journal • Primers • Retrieval Augmented Generation
PICKUP ARTICLES
Pickup Articles
-
Generative AI & Foundation Models
When AI-Generated Text Becomes Commodity, Value Shifts to Experience and Credibility
An analysis of the value of AI-generated text in light of authorship research, credibility studies, and earlier debates about reproducibility.
-
Generative AI & Foundation Models
Why Can't AI Companies Stop Competing Over Benchmarks?
Competition over large language model performance has entered a stage where single benchmarks no longer capture true capabilities, as shown by multifaceted evaluations like Stanfor
-
Generative AI & Foundation Models
Open Source AI Has a Definition Problem
This article frames the open-source AI debate as a governance and infrastructure issue rather than a branding dispute.