Are AI Training Datasets the Next Frontier of Intellectual Property?

Fikri Mülkiyet Tuğba Güleş - 11.03.2026 [email protected]

Introduction

Generative artificial intelligence (AI) systems are trained on vast datasets containing text, images, and other digital materials, often derived from copyrighted works. This practice has sparked growing legal disputes between creators and AI developers. While creators claim their works are used without authorization, technology companies argue that AI training merely extracts statistical patterns rather than reproducing expressive content. Current debates focus on whether such uses fall within doctrines like fair use in the United States or text-and-data mining (TDM) exceptions in the European Union. However, the issue may be deeper: AI training datasets function as informational infrastructure, raising the question of whether they represent a new category of intellectual assets beyond traditional copyright frameworks.

Courts Begin to Confront AI Training

Although the jurisprudence remains in its early stages, courts are beginning to address these questions.  In Thomson Reuters v Ross Intelligence, a U.S. federal court rejected the defendant’s fair-use defense where an AI system used copyrighted legal headnotes to train a competing legal research platform. The court held that the use interfered with the plaintiff’s licensing market and therefore could not be considered transformative.

Generative AI developers have also faced lawsuits from artists and media organizations. In Getty Images v Stability AI, Getty alleges that Stability AI copied millions of its copyrighted images including images bearing Getty watermarks to train its image-generation model.

Similarly, Andersen v Stability AI involves claims by visual artists who argue that generative AI models trained on their works can generate images that imitate their distinctive artistic styles. The case raises novel questions about whether training datasets enable indirect appropriation of artistic expression.

Together, these disputes illustrate the growing difficulty of applying traditional copyright doctrine to machine learning technologies.

Diverging Regulatory Approaches

While courts grapple with these issues, policymakers across jurisdictions are developing different regulatory responses:

  • The European Union

The European Union has adopted the most comprehensive regulatory framework to date. The EU Artificial Intelligence Act requires developers of general-purpose AI models to disclose summaries of the training data used in their systems. These transparency obligations aim to ensure compliance with EU copyright law. The AI Act operates alongside the TDM exceptions introduced by the Digital Single Market Directive, which allow automated analysis of copyrighted works unless rights holders opt out. This framework seeks to balance innovation with creator protection through transparency and opt-out mechanisms.

  • Türkiye

Türkiye has not yet adopted specific legislation addressing the use of copyrighted works in AI training. The existing framework under the Law on Intellectual and Artistic Works (FSEK No. 5846) governs reproduction and use of protected works but does not expressly regulate text-and-data mining or machine learning activities. As a result, the legality of AI training datasets would likely be assessed through general copyright principles, including reproduction rights and limitations to authors’ rights. With AI development accelerating globally and the EU advancing regulatory standards through the AI Act, Türkiye may soon face pressure to clarify the legal status of AI training practices.

  • The United States

By contrast, the United States has largely taken a litigation-driven approach, leaving courts to determine whether AI training qualifies as fair use. This reflects a broader policy divergence: the EU tends to regulate emerging technologies through ex ante rules, whereas the U.S. often relies on judicial interpretation to adapt existing doctrines.

  • Japan

Japan represents a third regulatory model. Its copyright law includes one of the world’s most permissive data-analysis exceptions, allowing copyrighted works to be used for machine learning regardless of purpose. This policy has helped position Japan as an attractive environment for AI development.

These differing approaches highlight the absence of a harmonized international framework governing AI training datasets.

Why Training Data May Require a New IP Framework?

The controversy surrounding AI training data suggests that existing copyright doctrines may be insufficient. Unlike traditional works, training datasets derive value from aggregating and computationally analyzing large-scale data, with their informational content becoming embedded in a model’s parameters. While this resembles the logic behind the EU’s sui generis database right, AI datasets operate at a much larger and more dynamic scale. Recognizing them as a new legal category could foster licensing markets and encourage investment in data curation. However, granting proprietary rights over datasets may risk privatizing the internet’s informational commons and reinforcing the power of dominant technology companies. The debate therefore raises a broader policy question: how should intellectual property law balance innovation, competition, and access to knowledge in the age of artificial intelligence?

Conclusion

The legal conflicts surrounding AI training datasets reveal a structural tension within intellectual property law. Copyright doctrine was designed to regulate the reproduction of creative works, yet generative AI systems transform those works into statistical knowledge used to generate new content. Courts currently attempt to address these disputes through doctrines such as fair use and text-and-data mining exceptions. However, the deeper challenge lies in the nature of training data itself. AI datasets function less like traditional works and more like infrastructural resources for knowledge production. If that is the case, the central policy question is no longer whether AI training infringes copyright. Instead, it is whether the datasets powering AI should be recognized as the next frontier of intellectual property law.


Resources

1. 17 U.S.C. §107 (Fair Use Doctrine).

2. Thomson Reuters Enterprise Centre GmbH v Ross Intelligence Inc., No. 1:20-cv-613 (D. Del. 2023).

3. Getty Images (US) Inc. v Stability AI Ltd., No. 1:23-cv-00135 (D. Del. 2023).

4. Andersen v Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. 2023).

5. European Parliament. “EU AI Act: First Regulation on Artificial Intelligence.” https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence