Files
NOVA/docs/DATA_LICENSES.md
Dani a7f091aa45 Initial commit: NOVA - Neuro-Optimizing Versatile Agent
Complete transformer LLM built from scratch with:

Core Features:
- Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache)
- SentencePiece tokenizer (BPE/Unigram)
- Training pipeline (AMP, gradient checkpointing, DDP)
- Persona system with personality matrix (NO AI disclosure by default)
- Genetic evolution (NOVA-EVO) for hyperparameter optimization
- Legal-only data pipeline with license tracking
- Chat interface (CLI + REST API)
- Conversation memory (SQLite)

Model Sizes:
- 125M, 350M, 1.3B, 3B parameters
- Local-first, runs on CPU or GPU
- Python 3.10.6+, PyTorch 2.0+

Personas:
- girlfriend_gentle (high warmth, high empathy)
- girlfriend_playful (high humor, high playfulness)
- girlfriend_supportive (balanced, default)

Documentation:
- Complete README with quickstart
- Model card with ethical considerations
- Privacy documentation (local-first, zero telemetry)
- Data licenses and attribution
- Contributing guide

Infrastructure:
- GitHub Actions CI/CD
- Comprehensive test suite
- Quickstart script
- CLI tool

License: Apache 2.0

🤖 Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 20:56:37 -04:00

6.7 KiB

Data Licenses and Attribution

NOVA is committed to using only legally licensed datasets for training. This document tracks all approved data sources and their licenses.


License Philosophy

What We Use

Public Domain: No restrictions CC0: Public domain dedication CC-BY: Attribution required MIT/Apache/BSD: Permissive open source

What We DON'T Use

All Rights Reserved: Copyrighted without permission CC-BY-NC: Non-commercial restrictions CC-BY-ND: No derivatives restrictions Unknown/Unlicensed: No verified license Scraped Web Data: Without license verification


Approved Dataset Sources

1. Wikipedia (English)

License: CC-BY-SA 3.0 URL: https://dumps.wikimedia.org/ Size: ~20 GB (compressed) Language: English Description: English Wikipedia articles

Attribution:

Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.

Usage: Text data for general knowledge


2. Project Gutenberg

License: Public Domain URL: https://www.gutenberg.org/ Size: ~15 GB Language: Primarily English Description: Public domain books (pre-1928 in US)

Attribution:

Project Gutenberg. Public domain literary works.

Usage: Literary text, historical documents


3. OpenWebText

License: CC0 1.0 (Public Domain Dedication) URL: https://huggingface.co/datasets/Skylion007/openwebtext Size: ~38 GB Language: English Description: Open reproduction of WebText (Reddit links)

Attribution:

OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.

Usage: Web-scraped text (Reddit-filtered)


4. C4 (Colossal Clean Crawled Corpus)

License: ODC-BY (Open Data Commons Attribution) URL: https://huggingface.co/datasets/c4 Size: ~300 GB (en subset) Language: English Description: Cleaned Common Crawl data

Attribution:

C4 dataset from Google's T5 paper. ODC-BY license.

Usage: Large-scale web text


5. The Pile - ArXiv Subset

License: Various (mostly permissive for ArXiv subset) URL: https://pile.eleuther.ai/ Size: ~60 GB (ArXiv subset) Language: English Description: ArXiv papers (scientific articles)

Attribution:

The Pile by EleutherAI. ArXiv papers subset.

Usage: Scientific and technical text

Note: Only use subsets with verified permissive licenses


License Tracking System

Ledger File

All downloaded datasets tracked in:

data/processed/license_ledger.json

Format:

{
  "sources": [
    {
      "name": "wikipedia-en",
      "license": "cc-by-sa-3.0",
      "url": "https://dumps.wikimedia.org/enwiki/",
      "download_date": "2025-01-15",
      "size_gb": 20.5,
      "attribution": "Wikipedia contributors..."
    }
  ]
}

Verification

Before training, verify licenses:

python -m nova_data.pipeline verify_licenses

This checks that all data sources have approved licenses.


Attribution Requirements

CC-BY Datasets

Required:

  • Attribute the original creator
  • Include license name
  • Link to license
  • Indicate if changes were made

Our Attribution:

All NOVA models trained on CC-BY data include:

This model was trained on data including:

  • Wikipedia (CC-BY-SA 3.0)
  • [Other CC-BY sources]

Full attributions in DATA_LICENSES.md

Public Domain

Required: None (but we attribute anyway for transparency)


Custom Datasets

User-Provided Data

If training NOVA on your own data:

Your Responsibility:

  • Ensure you have rights to use the data
  • Verify any license requirements
  • Add custom sources to ledger

Example:

# configs/data/custom.yaml
sources:
  - name: my-custom-dataset
    license: mit  # or your license
    path: /path/to/data
    description: My custom training data

Commercial Use Considerations

NOVA Code

License: Apache 2.0 Commercial Use: Allowed

Training Data

Depends on dataset:

Dataset Commercial Use
Wikipedia Allowed (with attribution)
Project Gutenberg Allowed (public domain)
OpenWebText Allowed (CC0)
C4 Allowed (ODC-BY, with attribution)
The Pile (ArXiv) ⚠️ Verify per-subset

Recommendation: Review each dataset's license for commercial projects.


Excluded Sources

Why We Don't Use Certain Data

Common Crawl (raw):

  • Contains copyrighted material
  • License status unclear for many pages
  • We use filtered versions (C4) instead

Social Media (Twitter, etc.):

  • Terms of Service restrictions
  • Privacy concerns
  • Unclear licensing

Books3/LibGen:

  • Contains copyrighted books
  • Legal issues
  • Not permissively licensed

YouTube Subtitles:

  • Copyright unclear
  • TOS restrictions

Compliance Checklist

Before training NOVA:

  • All data sources listed in license_ledger.json
  • Each source has verified license
  • Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
  • Attribution prepared for CC-BY sources
  • No excluded sources used

Future Datasets

Planned Additions

We're evaluating these sources:

  • BookCorpus: Open domain books (pending license review)
  • Stack Exchange: CC-BY-SA (with attribution)
  • OpenSubtitles: Public domain/permissive subset
  • Code datasets: GitHub permissive licenses (MIT, Apache, BSD)

Criteria:

  • Clear, permissive license
  • High quality
  • Legally distributable

Dataset Removal Requests

If you believe we've incorrectly listed a dataset:

  1. Open an issue: github.com/yourusername/nova/issues
  2. Include:
    • Dataset name
    • License concern
    • Supporting documentation
  3. We'll review and respond within 7 days

This project aims for legal compliance, but:

  • We're not lawyers
  • License interpretation may vary by jurisdiction
  • Users are responsible for their own compliance
  • Consult legal counsel for commercial use

NOVA project provides this information for transparency, but makes no warranties about legal compliance.


References

License Texts

Resources


Last Updated: 2025 Document Version: 1.0 Review Frequency: Quarterly