Initial commit: NOVA - Neuro-Optimizing Versatile Agent

Complete transformer LLM built from scratch with:

Core Features:
- Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache)
- SentencePiece tokenizer (BPE/Unigram)
- Training pipeline (AMP, gradient checkpointing, DDP)
- Persona system with personality matrix (NO AI disclosure by default)
- Genetic evolution (NOVA-EVO) for hyperparameter optimization
- Legal-only data pipeline with license tracking
- Chat interface (CLI + REST API)
- Conversation memory (SQLite)

Model Sizes:
- 125M, 350M, 1.3B, 3B parameters
- Local-first, runs on CPU or GPU
- Python 3.10.6+, PyTorch 2.0+

Personas:
- girlfriend_gentle (high warmth, high empathy)
- girlfriend_playful (high humor, high playfulness)
- girlfriend_supportive (balanced, default)

Documentation:
- Complete README with quickstart
- Model card with ethical considerations
- Privacy documentation (local-first, zero telemetry)
- Data licenses and attribution
- Contributing guide

Infrastructure:
- GitHub Actions CI/CD
- Comprehensive test suite
- Quickstart script
- CLI tool

License: Apache 2.0

🤖 Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-10-12 20:56:37 -04:00
commit a7f091aa45
50 changed files with 6437 additions and 0 deletions

315
docs/DATA_LICENSES.md Normal file
View File

@@ -0,0 +1,315 @@
# Data Licenses and Attribution
NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.
---
## License Philosophy
### What We Use
**Public Domain:** No restrictions
**CC0:** Public domain dedication
**CC-BY:** Attribution required
**MIT/Apache/BSD:** Permissive open source
### What We DON'T Use
**All Rights Reserved:** Copyrighted without permission
**CC-BY-NC:** Non-commercial restrictions
**CC-BY-ND:** No derivatives restrictions
**Unknown/Unlicensed:** No verified license
**Scraped Web Data:** Without license verification
---
## Approved Dataset Sources
### 1. Wikipedia (English)
**License:** CC-BY-SA 3.0
**URL:** https://dumps.wikimedia.org/
**Size:** ~20 GB (compressed)
**Language:** English
**Description:** English Wikipedia articles
**Attribution:**
> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
**Usage:** Text data for general knowledge
---
### 2. Project Gutenberg
**License:** Public Domain
**URL:** https://www.gutenberg.org/
**Size:** ~15 GB
**Language:** Primarily English
**Description:** Public domain books (pre-1928 in US)
**Attribution:**
> Project Gutenberg. Public domain literary works.
**Usage:** Literary text, historical documents
---
### 3. OpenWebText
**License:** CC0 1.0 (Public Domain Dedication)
**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
**Size:** ~38 GB
**Language:** English
**Description:** Open reproduction of WebText (Reddit links)
**Attribution:**
> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
**Usage:** Web-scraped text (Reddit-filtered)
---
### 4. C4 (Colossal Clean Crawled Corpus)
**License:** ODC-BY (Open Data Commons Attribution)
**URL:** https://huggingface.co/datasets/c4
**Size:** ~300 GB (en subset)
**Language:** English
**Description:** Cleaned Common Crawl data
**Attribution:**
> C4 dataset from Google's T5 paper. ODC-BY license.
**Usage:** Large-scale web text
---
### 5. The Pile - ArXiv Subset
**License:** Various (mostly permissive for ArXiv subset)
**URL:** https://pile.eleuther.ai/
**Size:** ~60 GB (ArXiv subset)
**Language:** English
**Description:** ArXiv papers (scientific articles)
**Attribution:**
> The Pile by EleutherAI. ArXiv papers subset.
**Usage:** Scientific and technical text
**Note:** Only use subsets with verified permissive licenses
---
## License Tracking System
### Ledger File
All downloaded datasets tracked in:
```
data/processed/license_ledger.json
```
**Format:**
```json
{
"sources": [
{
"name": "wikipedia-en",
"license": "cc-by-sa-3.0",
"url": "https://dumps.wikimedia.org/enwiki/",
"download_date": "2025-01-15",
"size_gb": 20.5,
"attribution": "Wikipedia contributors..."
}
]
}
```
### Verification
Before training, verify licenses:
```bash
python -m nova_data.pipeline verify_licenses
```
This checks that all data sources have approved licenses.
---
## Attribution Requirements
### CC-BY Datasets
**Required:**
- Attribute the original creator
- Include license name
- Link to license
- Indicate if changes were made
**Our Attribution:**
All NOVA models trained on CC-BY data include:
> This model was trained on data including:
> - Wikipedia (CC-BY-SA 3.0)
> - [Other CC-BY sources]
>
> Full attributions in DATA_LICENSES.md
### Public Domain
**Required:** None (but we attribute anyway for transparency)
---
## Custom Datasets
### User-Provided Data
If training NOVA on your own data:
**Your Responsibility:**
- Ensure you have rights to use the data
- Verify any license requirements
- Add custom sources to ledger
**Example:**
```yaml
# configs/data/custom.yaml
sources:
- name: my-custom-dataset
license: mit # or your license
path: /path/to/data
description: My custom training data
```
---
## Commercial Use Considerations
### NOVA Code
**License:** Apache 2.0
**Commercial Use:** ✅ Allowed
### Training Data
Depends on dataset:
| Dataset | Commercial Use |
|---------|---------------|
| Wikipedia | ✅ Allowed (with attribution) |
| Project Gutenberg | ✅ Allowed (public domain) |
| OpenWebText | ✅ Allowed (CC0) |
| C4 | ✅ Allowed (ODC-BY, with attribution) |
| The Pile (ArXiv) | ⚠️ Verify per-subset |
**Recommendation:** Review each dataset's license for commercial projects.
---
## Excluded Sources
### Why We Don't Use Certain Data
**Common Crawl (raw):**
- Contains copyrighted material
- License status unclear for many pages
- We use filtered versions (C4) instead
**Social Media (Twitter, etc.):**
- Terms of Service restrictions
- Privacy concerns
- Unclear licensing
**Books3/LibGen:**
- Contains copyrighted books
- Legal issues
- Not permissively licensed
**YouTube Subtitles:**
- Copyright unclear
- TOS restrictions
---
## Compliance Checklist
Before training NOVA:
- [ ] All data sources listed in `license_ledger.json`
- [ ] Each source has verified license
- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
- [ ] Attribution prepared for CC-BY sources
- [ ] No excluded sources used
---
## Future Datasets
### Planned Additions
We're evaluating these sources:
- **BookCorpus:** Open domain books (pending license review)
- **Stack Exchange:** CC-BY-SA (with attribution)
- **OpenSubtitles:** Public domain/permissive subset
- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)
**Criteria:**
- Clear, permissive license
- High quality
- Legally distributable
---
## Dataset Removal Requests
If you believe we've incorrectly listed a dataset:
1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
2. Include:
- Dataset name
- License concern
- Supporting documentation
3. We'll review and respond within 7 days
---
## Legal Disclaimer
**This project aims for legal compliance, but:**
- We're not lawyers
- License interpretation may vary by jurisdiction
- Users are responsible for their own compliance
- Consult legal counsel for commercial use
**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**
---
## References
### License Texts
- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
- **MIT:** https://opensource.org/licenses/MIT
- **ODC-BY:** https://opendatacommons.org/licenses/by/
### Resources
- Creative Commons: https://creativecommons.org/
- Open Data Commons: https://opendatacommons.org/
- OSI Licenses: https://opensource.org/licenses
---
**Last Updated:** 2025
**Document Version:** 1.0
**Review Frequency:** Quarterly