Initial commit: NOVA - Neuro-Optimizing Versatile Agent
Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
315
docs/DATA_LICENSES.md
Normal file
315
docs/DATA_LICENSES.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Data Licenses and Attribution
|
||||
|
||||
NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.
|
||||
|
||||
---
|
||||
|
||||
## License Philosophy
|
||||
|
||||
### What We Use
|
||||
|
||||
✅ **Public Domain:** No restrictions
|
||||
✅ **CC0:** Public domain dedication
|
||||
✅ **CC-BY:** Attribution required
|
||||
✅ **MIT/Apache/BSD:** Permissive open source
|
||||
|
||||
### What We DON'T Use
|
||||
|
||||
❌ **All Rights Reserved:** Copyrighted without permission
|
||||
❌ **CC-BY-NC:** Non-commercial restrictions
|
||||
❌ **CC-BY-ND:** No derivatives restrictions
|
||||
❌ **Unknown/Unlicensed:** No verified license
|
||||
❌ **Scraped Web Data:** Without license verification
|
||||
|
||||
---
|
||||
|
||||
## Approved Dataset Sources
|
||||
|
||||
### 1. Wikipedia (English)
|
||||
|
||||
**License:** CC-BY-SA 3.0
|
||||
**URL:** https://dumps.wikimedia.org/
|
||||
**Size:** ~20 GB (compressed)
|
||||
**Language:** English
|
||||
**Description:** English Wikipedia articles
|
||||
|
||||
**Attribution:**
|
||||
> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
|
||||
|
||||
**Usage:** Text data for general knowledge
|
||||
|
||||
---
|
||||
|
||||
### 2. Project Gutenberg
|
||||
|
||||
**License:** Public Domain
|
||||
**URL:** https://www.gutenberg.org/
|
||||
**Size:** ~15 GB
|
||||
**Language:** Primarily English
|
||||
**Description:** Public domain books (pre-1928 in US)
|
||||
|
||||
**Attribution:**
|
||||
> Project Gutenberg. Public domain literary works.
|
||||
|
||||
**Usage:** Literary text, historical documents
|
||||
|
||||
---
|
||||
|
||||
### 3. OpenWebText
|
||||
|
||||
**License:** CC0 1.0 (Public Domain Dedication)
|
||||
**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
|
||||
**Size:** ~38 GB
|
||||
**Language:** English
|
||||
**Description:** Open reproduction of WebText (Reddit links)
|
||||
|
||||
**Attribution:**
|
||||
> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
|
||||
|
||||
**Usage:** Web-scraped text (Reddit-filtered)
|
||||
|
||||
---
|
||||
|
||||
### 4. C4 (Colossal Clean Crawled Corpus)
|
||||
|
||||
**License:** ODC-BY (Open Data Commons Attribution)
|
||||
**URL:** https://huggingface.co/datasets/c4
|
||||
**Size:** ~300 GB (en subset)
|
||||
**Language:** English
|
||||
**Description:** Cleaned Common Crawl data
|
||||
|
||||
**Attribution:**
|
||||
> C4 dataset from Google's T5 paper. ODC-BY license.
|
||||
|
||||
**Usage:** Large-scale web text
|
||||
|
||||
---
|
||||
|
||||
### 5. The Pile - ArXiv Subset
|
||||
|
||||
**License:** Various (mostly permissive for ArXiv subset)
|
||||
**URL:** https://pile.eleuther.ai/
|
||||
**Size:** ~60 GB (ArXiv subset)
|
||||
**Language:** English
|
||||
**Description:** ArXiv papers (scientific articles)
|
||||
|
||||
**Attribution:**
|
||||
> The Pile by EleutherAI. ArXiv papers subset.
|
||||
|
||||
**Usage:** Scientific and technical text
|
||||
|
||||
**Note:** Only use subsets with verified permissive licenses
|
||||
|
||||
---
|
||||
|
||||
## License Tracking System
|
||||
|
||||
### Ledger File
|
||||
|
||||
All downloaded datasets tracked in:
|
||||
```
|
||||
data/processed/license_ledger.json
|
||||
```
|
||||
|
||||
**Format:**
|
||||
```json
|
||||
{
|
||||
"sources": [
|
||||
{
|
||||
"name": "wikipedia-en",
|
||||
"license": "cc-by-sa-3.0",
|
||||
"url": "https://dumps.wikimedia.org/enwiki/",
|
||||
"download_date": "2025-01-15",
|
||||
"size_gb": 20.5,
|
||||
"attribution": "Wikipedia contributors..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Before training, verify licenses:
|
||||
|
||||
```bash
|
||||
python -m nova_data.pipeline verify_licenses
|
||||
```
|
||||
|
||||
This checks that all data sources have approved licenses.
|
||||
|
||||
---
|
||||
|
||||
## Attribution Requirements
|
||||
|
||||
### CC-BY Datasets
|
||||
|
||||
**Required:**
|
||||
- Attribute the original creator
|
||||
- Include license name
|
||||
- Link to license
|
||||
- Indicate if changes were made
|
||||
|
||||
**Our Attribution:**
|
||||
|
||||
All NOVA models trained on CC-BY data include:
|
||||
|
||||
> This model was trained on data including:
|
||||
> - Wikipedia (CC-BY-SA 3.0)
|
||||
> - [Other CC-BY sources]
|
||||
>
|
||||
> Full attributions in DATA_LICENSES.md
|
||||
|
||||
### Public Domain
|
||||
|
||||
**Required:** None (but we attribute anyway for transparency)
|
||||
|
||||
---
|
||||
|
||||
## Custom Datasets
|
||||
|
||||
### User-Provided Data
|
||||
|
||||
If training NOVA on your own data:
|
||||
|
||||
**Your Responsibility:**
|
||||
- Ensure you have rights to use the data
|
||||
- Verify any license requirements
|
||||
- Add custom sources to ledger
|
||||
|
||||
**Example:**
|
||||
```yaml
|
||||
# configs/data/custom.yaml
|
||||
sources:
|
||||
- name: my-custom-dataset
|
||||
license: mit # or your license
|
||||
path: /path/to/data
|
||||
description: My custom training data
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Commercial Use Considerations
|
||||
|
||||
### NOVA Code
|
||||
|
||||
**License:** Apache 2.0
|
||||
**Commercial Use:** ✅ Allowed
|
||||
|
||||
### Training Data
|
||||
|
||||
Depends on dataset:
|
||||
|
||||
| Dataset | Commercial Use |
|
||||
|---------|---------------|
|
||||
| Wikipedia | ✅ Allowed (with attribution) |
|
||||
| Project Gutenberg | ✅ Allowed (public domain) |
|
||||
| OpenWebText | ✅ Allowed (CC0) |
|
||||
| C4 | ✅ Allowed (ODC-BY, with attribution) |
|
||||
| The Pile (ArXiv) | ⚠️ Verify per-subset |
|
||||
|
||||
**Recommendation:** Review each dataset's license for commercial projects.
|
||||
|
||||
---
|
||||
|
||||
## Excluded Sources
|
||||
|
||||
### Why We Don't Use Certain Data
|
||||
|
||||
**Common Crawl (raw):**
|
||||
- Contains copyrighted material
|
||||
- License status unclear for many pages
|
||||
- We use filtered versions (C4) instead
|
||||
|
||||
**Social Media (Twitter, etc.):**
|
||||
- Terms of Service restrictions
|
||||
- Privacy concerns
|
||||
- Unclear licensing
|
||||
|
||||
**Books3/LibGen:**
|
||||
- Contains copyrighted books
|
||||
- Legal issues
|
||||
- Not permissively licensed
|
||||
|
||||
**YouTube Subtitles:**
|
||||
- Copyright unclear
|
||||
- TOS restrictions
|
||||
|
||||
---
|
||||
|
||||
## Compliance Checklist
|
||||
|
||||
Before training NOVA:
|
||||
|
||||
- [ ] All data sources listed in `license_ledger.json`
|
||||
- [ ] Each source has verified license
|
||||
- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
|
||||
- [ ] Attribution prepared for CC-BY sources
|
||||
- [ ] No excluded sources used
|
||||
|
||||
---
|
||||
|
||||
## Future Datasets
|
||||
|
||||
### Planned Additions
|
||||
|
||||
We're evaluating these sources:
|
||||
|
||||
- **BookCorpus:** Open domain books (pending license review)
|
||||
- **Stack Exchange:** CC-BY-SA (with attribution)
|
||||
- **OpenSubtitles:** Public domain/permissive subset
|
||||
- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)
|
||||
|
||||
**Criteria:**
|
||||
- Clear, permissive license
|
||||
- High quality
|
||||
- Legally distributable
|
||||
|
||||
---
|
||||
|
||||
## Dataset Removal Requests
|
||||
|
||||
If you believe we've incorrectly listed a dataset:
|
||||
|
||||
1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
|
||||
2. Include:
|
||||
- Dataset name
|
||||
- License concern
|
||||
- Supporting documentation
|
||||
3. We'll review and respond within 7 days
|
||||
|
||||
---
|
||||
|
||||
## Legal Disclaimer
|
||||
|
||||
**This project aims for legal compliance, but:**
|
||||
|
||||
- We're not lawyers
|
||||
- License interpretation may vary by jurisdiction
|
||||
- Users are responsible for their own compliance
|
||||
- Consult legal counsel for commercial use
|
||||
|
||||
**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
### License Texts
|
||||
|
||||
- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
|
||||
- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
|
||||
- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
|
||||
- **MIT:** https://opensource.org/licenses/MIT
|
||||
- **ODC-BY:** https://opendatacommons.org/licenses/by/
|
||||
|
||||
### Resources
|
||||
|
||||
- Creative Commons: https://creativecommons.org/
|
||||
- Open Data Commons: https://opendatacommons.org/
|
||||
- OSI Licenses: https://opensource.org/licenses
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025
|
||||
**Document Version:** 1.0
|
||||
**Review Frequency:** Quarterly
|
Reference in New Issue
Block a user