Initial commit: NOVA - Neuro-Optimizing Versatile Agent

Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 20:56:37 -04:00
commit a7f091aa45
50 changed files with 6437 additions and 0 deletions
--- a/docs/DATA_LICENSES.md
+++ b/docs/DATA_LICENSES.md
@@ -0,0 +1,315 @@
+# Data Licenses and Attribution
+
+NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.
+
+---
+
+## License Philosophy
+
+### What We Use
+
+✅ **Public Domain:** No restrictions
+✅ **CC0:** Public domain dedication
+✅ **CC-BY:** Attribution required
+✅ **MIT/Apache/BSD:** Permissive open source
+
+### What We DON'T Use
+
+❌ **All Rights Reserved:** Copyrighted without permission
+❌ **CC-BY-NC:** Non-commercial restrictions
+❌ **CC-BY-ND:** No derivatives restrictions
+❌ **Unknown/Unlicensed:** No verified license
+❌ **Scraped Web Data:** Without license verification
+
+---
+
+## Approved Dataset Sources
+
+### 1. Wikipedia (English)
+
+**License:** CC-BY-SA 3.0
+**URL:** https://dumps.wikimedia.org/
+**Size:** ~20 GB (compressed)
+**Language:** English
+**Description:** English Wikipedia articles
+
+**Attribution:**
+> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
+
+**Usage:** Text data for general knowledge
+
+---
+
+### 2. Project Gutenberg
+
+**License:** Public Domain
+**URL:** https://www.gutenberg.org/
+**Size:** ~15 GB
+**Language:** Primarily English
+**Description:** Public domain books (pre-1928 in US)
+
+**Attribution:**
+> Project Gutenberg. Public domain literary works.
+
+**Usage:** Literary text, historical documents
+
+---
+
+### 3. OpenWebText
+
+**License:** CC0 1.0 (Public Domain Dedication)
+**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
+**Size:** ~38 GB
+**Language:** English
+**Description:** Open reproduction of WebText (Reddit links)
+
+**Attribution:**
+> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
+
+**Usage:** Web-scraped text (Reddit-filtered)
+
+---
+
+### 4. C4 (Colossal Clean Crawled Corpus)
+
+**License:** ODC-BY (Open Data Commons Attribution)
+**URL:** https://huggingface.co/datasets/c4
+**Size:** ~300 GB (en subset)
+**Language:** English
+**Description:** Cleaned Common Crawl data
+
+**Attribution:**
+> C4 dataset from Google's T5 paper. ODC-BY license.
+
+**Usage:** Large-scale web text
+
+---
+
+### 5. The Pile - ArXiv Subset
+
+**License:** Various (mostly permissive for ArXiv subset)
+**URL:** https://pile.eleuther.ai/
+**Size:** ~60 GB (ArXiv subset)
+**Language:** English
+**Description:** ArXiv papers (scientific articles)
+
+**Attribution:**
+> The Pile by EleutherAI. ArXiv papers subset.
+
+**Usage:** Scientific and technical text
+
+**Note:** Only use subsets with verified permissive licenses
+
+---
+
+## License Tracking System
+
+### Ledger File
+
+All downloaded datasets tracked in:
+```
+data/processed/license_ledger.json
+```
+
+**Format:**
+```json
+{
+  "sources": [
+    {
+      "name": "wikipedia-en",
+      "license": "cc-by-sa-3.0",
+      "url": "https://dumps.wikimedia.org/enwiki/",
+      "download_date": "2025-01-15",
+      "size_gb": 20.5,
+      "attribution": "Wikipedia contributors..."
+    }
+  ]
+}
+```
+
+### Verification
+
+Before training, verify licenses:
+
+```bash
+python -m nova_data.pipeline verify_licenses
+```
+
+This checks that all data sources have approved licenses.
+
+---
+
+## Attribution Requirements
+
+### CC-BY Datasets
+
+**Required:**
+- Attribute the original creator
+- Include license name
+- Link to license
+- Indicate if changes were made
+
+**Our Attribution:**
+
+All NOVA models trained on CC-BY data include:
+
+> This model was trained on data including:
+> - Wikipedia (CC-BY-SA 3.0)
+> - [Other CC-BY sources]
+>
+> Full attributions in DATA_LICENSES.md
+
+### Public Domain
+
+**Required:** None (but we attribute anyway for transparency)
+
+---
+
+## Custom Datasets
+
+### User-Provided Data
+
+If training NOVA on your own data:
+
+**Your Responsibility:**
+- Ensure you have rights to use the data
+- Verify any license requirements
+- Add custom sources to ledger
+
+**Example:**
+```yaml
+# configs/data/custom.yaml
+sources:
+  - name: my-custom-dataset
+    license: mit  # or your license
+    path: /path/to/data
+    description: My custom training data
+```
+
+---
+
+## Commercial Use Considerations
+
+### NOVA Code
+
+**License:** Apache 2.0
+**Commercial Use:** ✅ Allowed
+
+### Training Data
+
+Depends on dataset:
+
+| Dataset | Commercial Use |
+|---------|---------------|
+| Wikipedia | ✅ Allowed (with attribution) |
+| Project Gutenberg | ✅ Allowed (public domain) |
+| OpenWebText | ✅ Allowed (CC0) |
+| C4 | ✅ Allowed (ODC-BY, with attribution) |
+| The Pile (ArXiv) | ⚠️ Verify per-subset |
+
+**Recommendation:** Review each dataset's license for commercial projects.
+
+---
+
+## Excluded Sources
+
+### Why We Don't Use Certain Data
+
+**Common Crawl (raw):**
+- Contains copyrighted material
+- License status unclear for many pages
+- We use filtered versions (C4) instead
+
+**Social Media (Twitter, etc.):**
+- Terms of Service restrictions
+- Privacy concerns
+- Unclear licensing
+
+**Books3/LibGen:**
+- Contains copyrighted books
+- Legal issues
+- Not permissively licensed
+
+**YouTube Subtitles:**
+- Copyright unclear
+- TOS restrictions
+
+---
+
+## Compliance Checklist
+
+Before training NOVA:
+
+- [ ] All data sources listed in `license_ledger.json`
+- [ ] Each source has verified license
+- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
+- [ ] Attribution prepared for CC-BY sources
+- [ ] No excluded sources used
+
+---
+
+## Future Datasets
+
+### Planned Additions
+
+We're evaluating these sources:
+
+- **BookCorpus:** Open domain books (pending license review)
+- **Stack Exchange:** CC-BY-SA (with attribution)
+- **OpenSubtitles:** Public domain/permissive subset
+- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)
+
+**Criteria:**
+- Clear, permissive license
+- High quality
+- Legally distributable
+
+---
+
+## Dataset Removal Requests
+
+If you believe we've incorrectly listed a dataset:
+
+1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
+2. Include:
+   - Dataset name
+   - License concern
+   - Supporting documentation
+3. We'll review and respond within 7 days
+
+---
+
+## Legal Disclaimer
+
+**This project aims for legal compliance, but:**
+
+- We're not lawyers
+- License interpretation may vary by jurisdiction
+- Users are responsible for their own compliance
+- Consult legal counsel for commercial use
+
+**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**
+
+---
+
+## References
+
+### License Texts
+
+- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
+- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
+- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
+- **MIT:** https://opensource.org/licenses/MIT
+- **ODC-BY:** https://opendatacommons.org/licenses/by/
+
+### Resources
+
+- Creative Commons: https://creativecommons.org/
+- Open Data Commons: https://opendatacommons.org/
+- OSI Licenses: https://opensource.org/licenses
+
+---
+
+**Last Updated:** 2025
+**Document Version:** 1.0
+**Review Frequency:** Quarterly