Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
6.7 KiB
Data Licenses and Attribution
NOVA is committed to using only legally licensed datasets for training. This document tracks all approved data sources and their licenses.
License Philosophy
What We Use
✅ Public Domain: No restrictions ✅ CC0: Public domain dedication ✅ CC-BY: Attribution required ✅ MIT/Apache/BSD: Permissive open source
What We DON'T Use
❌ All Rights Reserved: Copyrighted without permission ❌ CC-BY-NC: Non-commercial restrictions ❌ CC-BY-ND: No derivatives restrictions ❌ Unknown/Unlicensed: No verified license ❌ Scraped Web Data: Without license verification
Approved Dataset Sources
1. Wikipedia (English)
License: CC-BY-SA 3.0 URL: https://dumps.wikimedia.org/ Size: ~20 GB (compressed) Language: English Description: English Wikipedia articles
Attribution:
Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
Usage: Text data for general knowledge
2. Project Gutenberg
License: Public Domain URL: https://www.gutenberg.org/ Size: ~15 GB Language: Primarily English Description: Public domain books (pre-1928 in US)
Attribution:
Project Gutenberg. Public domain literary works.
Usage: Literary text, historical documents
3. OpenWebText
License: CC0 1.0 (Public Domain Dedication) URL: https://huggingface.co/datasets/Skylion007/openwebtext Size: ~38 GB Language: English Description: Open reproduction of WebText (Reddit links)
Attribution:
OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
Usage: Web-scraped text (Reddit-filtered)
4. C4 (Colossal Clean Crawled Corpus)
License: ODC-BY (Open Data Commons Attribution) URL: https://huggingface.co/datasets/c4 Size: ~300 GB (en subset) Language: English Description: Cleaned Common Crawl data
Attribution:
C4 dataset from Google's T5 paper. ODC-BY license.
Usage: Large-scale web text
5. The Pile - ArXiv Subset
License: Various (mostly permissive for ArXiv subset) URL: https://pile.eleuther.ai/ Size: ~60 GB (ArXiv subset) Language: English Description: ArXiv papers (scientific articles)
Attribution:
The Pile by EleutherAI. ArXiv papers subset.
Usage: Scientific and technical text
Note: Only use subsets with verified permissive licenses
License Tracking System
Ledger File
All downloaded datasets tracked in:
data/processed/license_ledger.json
Format:
{
"sources": [
{
"name": "wikipedia-en",
"license": "cc-by-sa-3.0",
"url": "https://dumps.wikimedia.org/enwiki/",
"download_date": "2025-01-15",
"size_gb": 20.5,
"attribution": "Wikipedia contributors..."
}
]
}
Verification
Before training, verify licenses:
python -m nova_data.pipeline verify_licenses
This checks that all data sources have approved licenses.
Attribution Requirements
CC-BY Datasets
Required:
- Attribute the original creator
- Include license name
- Link to license
- Indicate if changes were made
Our Attribution:
All NOVA models trained on CC-BY data include:
This model was trained on data including:
- Wikipedia (CC-BY-SA 3.0)
- [Other CC-BY sources]
Full attributions in DATA_LICENSES.md
Public Domain
Required: None (but we attribute anyway for transparency)
Custom Datasets
User-Provided Data
If training NOVA on your own data:
Your Responsibility:
- Ensure you have rights to use the data
- Verify any license requirements
- Add custom sources to ledger
Example:
# configs/data/custom.yaml
sources:
- name: my-custom-dataset
license: mit # or your license
path: /path/to/data
description: My custom training data
Commercial Use Considerations
NOVA Code
License: Apache 2.0 Commercial Use: ✅ Allowed
Training Data
Depends on dataset:
Dataset | Commercial Use |
---|---|
Wikipedia | ✅ Allowed (with attribution) |
Project Gutenberg | ✅ Allowed (public domain) |
OpenWebText | ✅ Allowed (CC0) |
C4 | ✅ Allowed (ODC-BY, with attribution) |
The Pile (ArXiv) | ⚠️ Verify per-subset |
Recommendation: Review each dataset's license for commercial projects.
Excluded Sources
Why We Don't Use Certain Data
Common Crawl (raw):
- Contains copyrighted material
- License status unclear for many pages
- We use filtered versions (C4) instead
Social Media (Twitter, etc.):
- Terms of Service restrictions
- Privacy concerns
- Unclear licensing
Books3/LibGen:
- Contains copyrighted books
- Legal issues
- Not permissively licensed
YouTube Subtitles:
- Copyright unclear
- TOS restrictions
Compliance Checklist
Before training NOVA:
- All data sources listed in
license_ledger.json
- Each source has verified license
- Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
- Attribution prepared for CC-BY sources
- No excluded sources used
Future Datasets
Planned Additions
We're evaluating these sources:
- BookCorpus: Open domain books (pending license review)
- Stack Exchange: CC-BY-SA (with attribution)
- OpenSubtitles: Public domain/permissive subset
- Code datasets: GitHub permissive licenses (MIT, Apache, BSD)
Criteria:
- Clear, permissive license
- High quality
- Legally distributable
Dataset Removal Requests
If you believe we've incorrectly listed a dataset:
- Open an issue: github.com/yourusername/nova/issues
- Include:
- Dataset name
- License concern
- Supporting documentation
- We'll review and respond within 7 days
Legal Disclaimer
This project aims for legal compliance, but:
- We're not lawyers
- License interpretation may vary by jurisdiction
- Users are responsible for their own compliance
- Consult legal counsel for commercial use
NOVA project provides this information for transparency, but makes no warranties about legal compliance.
References
License Texts
- CC-BY 4.0: https://creativecommons.org/licenses/by/4.0/
- CC0 1.0: https://creativecommons.org/publicdomain/zero/1.0/
- Apache 2.0: https://www.apache.org/licenses/LICENSE-2.0
- MIT: https://opensource.org/licenses/MIT
- ODC-BY: https://opendatacommons.org/licenses/by/
Resources
- Creative Commons: https://creativecommons.org/
- Open Data Commons: https://opendatacommons.org/
- OSI Licenses: https://opensource.org/licenses
Last Updated: 2025 Document Version: 1.0 Review Frequency: Quarterly