Context & Relevance
- Access to diverse and voluminous training data (books, articles, web content) is central to improving Large Language Models (LLMs).
- This includes both public domain and copyrighted works—raising significant legal and ethical issues when used without permission.
Relevance : GS 3(IPR , Technology)
Central Legal Issue
- Key Question: Does using copyrighted material for LLM training—without authorisation—constitute copyright infringement?
- In the U.S., this hinges on whether the use qualifies as “fair use” under Section 107 of the Copyright Act.
Fair Use Doctrine – Four Factors
Courts evaluate fair use claims based on:
- Purpose & Character: Is the use transformative (e.g., generating new knowledge vs reproducing existing works)?
- Nature of Work: Factual works are more likely to be fair use than fictional/creative ones.
- Amount & Substantiality: How much of the original was used?
- Market Effect: Does the use harm the original’s market or potential licensing revenue?
Case 1: Anthropic PBC (Claude LLM)
- Used copyrighted books—some legally purchased, some from questionable sources—to train its GenAI.
- Court ruling:
- Training with legally purchased books = Fair Use (due to transformative use).
- Copying from illegal sources = Not fair use ; court refused to grant blanket protection.
- Key takeaway: Court distinguishes between transformative use and how the data was acquired.
Case 2: Meta (LLaMA LLM)
- Sued by 13 authors for using illegally sourced books for training.
- Court ruling:
- Training = Fair Use (highly transformative).
- Plaintiffs failed to prove market harm with empirical data.
- Court did not penalise unauthorised downloading as a separate infringement (unlike Anthropic case).
- Judge acknowledged “market dilution” concern but said proof of harm was lacking.
Comparison: Anthropic vs Meta
Factor | Anthropic | Meta |
Transformative Use | Recognised | Recognised |
Market Harm | Downplayed | Downplayed but noted future risks |
Illegal Sourcing | Treated as separate infringement | Not distinctly addressed |
Judgement Focus | Data sourcing and use | Final use only |
Precedent Case: Thomson Reuters v. Ross Intelligence
- Court held no fair use because AI simply retrieved legal opinions (not transformative).
- Also competed directly with plaintiff’s product—thus hurting the market.
Emerging Legal Standards
- Courts seem to support transformative use in GenAI training—tilting toward fair use.
- But evidence of market harm will be crucial in future cases.
- Use of illegally sourced data may be treated as a separate violation—creating liability even if training is transformative.
Challenges for Plaintiffs
- Hard to prove “market substitution” or “licensing market harm.”
- LLM outputs are often not reproductions, but generated content—making infringement indirect and difficult to establish.
Implications Going Forward
- Unsettled legal landscape: Outcomes will vary case-by-case, based on data sourcing, model purpose, and market effects.
- Need for clearer copyright licensing frameworks and/or legislative clarity.
- Future rulings may hinge on empirical studies, including AI impact on creative economies.