Ethical AI Development

The Core Problem

LLMs are trained on vast amounts of data scraped from the internet, often without explicit consent from content creators. This raises fundamental questions:

Who owns the output of a model trained on stolen data?
How do we compensate creators whose work trains these models?
Can we build AI systems that respect intellectual property?

Building on Stolen Knowledge

We are building on stolen knowledge. LLMs themselves are inherently “thieves” of ideas. The foundation is questionable.

This isn’t just a philosophical concern—it’s a practical one:

Legal risk: Ongoing lawsuits against OpenAI, Stability AI, others
Moral hazard: Normalizing theft of creative work
Quality degradation: Training on AI-generated content creates feedback loops
Creator exploitation: Value extraction without compensation

Possible Solutions

1. Data Contributor Platforms

Create a platform where contributors make money off their data being sold in packages.

Think of this:

Tokenizing contribution value to every dividend
Users validate other contributors (like blockchain consensus mechanisms)
Proof-of-custody for data and data rights
Contributors opt-in and get compensated

2. Ethically Sourced LLMs

Build models using only:

Public domain content
Explicitly licensed training data
Open-source datasets with clear licenses
Content from creators who opted in

Yes, it might perform worse. But there are zero legal issues downstream.

3. Local-First AI

Combine with [[local-first-principles]]:

Models run on user devices (quantized, smaller models)
User data never leaves their machine for training
Fine-tuning happens locally with user’s own data
Federated learning without centralized data collection

Open Questions

How do we verify training data provenance?
What does “fair use” mean for training data?
Can we create sustainable economics for data contributors?
Should we regulate pre-training data like we regulate other intellectual property?

Connection to Thoughtful App Co

If Thoughtful App Co uses AI features, they must:

Be transparent about what models we use
Prefer ethically sourced or local models
Never use user data for training without explicit opt-in
Compensate data contributors when possible
Offer non-AI alternatives for all features

This is a Work in Progress

This note is a nebula—a cloud of potential ideas that need structure. As I learn more about:

Data rights legislation (EU AI Act, etc.)
Emerging ethical AI frameworks
Practical implementations of consent-based training
Economics of data marketplaces

…this note will evolve toward a more coherent framework.

Resources to Explore

Research EU AI Act provisions on training data
Investigate consent.ai and similar platforms
Study federated learning implementations
Connect with creators affected by AI training
Explore differential privacy techniques