Files

Dani a7f091aa45 Initial commit: NOVA - Neuro-Optimizing Versatile Agent

Complete transformer LLM built from scratch with:

Core Features:
- Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache)
- SentencePiece tokenizer (BPE/Unigram)
- Training pipeline (AMP, gradient checkpointing, DDP)
- Persona system with personality matrix (NO AI disclosure by default)
- Genetic evolution (NOVA-EVO) for hyperparameter optimization
- Legal-only data pipeline with license tracking
- Chat interface (CLI + REST API)
- Conversation memory (SQLite)

Model Sizes:
- 125M, 350M, 1.3B, 3B parameters
- Local-first, runs on CPU or GPU
- Python 3.10.6+, PyTorch 2.0+

Personas:
- girlfriend_gentle (high warmth, high empathy)
- girlfriend_playful (high humor, high playfulness)
- girlfriend_supportive (balanced, default)

Documentation:
- Complete README with quickstart
- Model card with ethical considerations
- Privacy documentation (local-first, zero telemetry)
- Data licenses and attribution
- Contributing guide

Infrastructure:
- GitHub Actions CI/CD
- Comprehensive test suite
- Quickstart script
- CLI tool

License: Apache 2.0

🤖 Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-12 20:56:37 -04:00

6.7 KiB

Raw Permalink Blame History

Data Licenses and Attribution

NOVA is committed to using only legally licensed datasets for training. This document tracks all approved data sources and their licenses.

License Philosophy

What We Use

✅ Public Domain: No restrictions ✅ CC0: Public domain dedication ✅ CC-BY: Attribution required ✅ MIT/Apache/BSD: Permissive open source

What We DON'T Use

❌ All Rights Reserved: Copyrighted without permission ❌ CC-BY-NC: Non-commercial restrictions ❌ CC-BY-ND: No derivatives restrictions ❌ Unknown/Unlicensed: No verified license ❌ Scraped Web Data: Without license verification

Approved Dataset Sources

1. Wikipedia (English)

License: CC-BY-SA 3.0 URL: https://dumps.wikimedia.org/ Size: ~20 GB (compressed) Language: English Description: English Wikipedia articles

Attribution:

Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.

Usage: Text data for general knowledge

2. Project Gutenberg

License: Public Domain URL: https://www.gutenberg.org/ Size: ~15 GB Language: Primarily English Description: Public domain books (pre-1928 in US)

Attribution:

Project Gutenberg. Public domain literary works.

Usage: Literary text, historical documents

3. OpenWebText

License: CC0 1.0 (Public Domain Dedication) URL: https://huggingface.co/datasets/Skylion007/openwebtext Size: ~38 GB Language: English Description: Open reproduction of WebText (Reddit links)

Attribution:

OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.

Usage: Web-scraped text (Reddit-filtered)

4. C4 (Colossal Clean Crawled Corpus)

License: ODC-BY (Open Data Commons Attribution) URL: https://huggingface.co/datasets/c4 Size: ~300 GB (en subset) Language: English Description: Cleaned Common Crawl data

Attribution:

C4 dataset from Google's T5 paper. ODC-BY license.

Usage: Large-scale web text

5. The Pile - ArXiv Subset

License: Various (mostly permissive for ArXiv subset) URL: https://pile.eleuther.ai/ Size: ~60 GB (ArXiv subset) Language: English Description: ArXiv papers (scientific articles)

Attribution:

The Pile by EleutherAI. ArXiv papers subset.

Usage: Scientific and technical text

Note: Only use subsets with verified permissive licenses

License Tracking System

Ledger File

All downloaded datasets tracked in:

data/processed/license_ledger.json

Format:

{
  "sources": [
    {
      "name": "wikipedia-en",
      "license": "cc-by-sa-3.0",
      "url": "https://dumps.wikimedia.org/enwiki/",
      "download_date": "2025-01-15",
      "size_gb": 20.5,
      "attribution": "Wikipedia contributors..."
    }
  ]
}

Verification

Before training, verify licenses:

python -m nova_data.pipeline verify_licenses

This checks that all data sources have approved licenses.

Attribution Requirements

CC-BY Datasets

Required:

Attribute the original creator
Include license name
Link to license
Indicate if changes were made

Our Attribution:

All NOVA models trained on CC-BY data include:

This model was trained on data including:

Wikipedia (CC-BY-SA 3.0)

[Other CC-BY sources]

Full attributions in DATA_LICENSES.md

Public Domain

Required: None (but we attribute anyway for transparency)

Custom Datasets

User-Provided Data

If training NOVA on your own data:

Your Responsibility:

Ensure you have rights to use the data
Verify any license requirements
Add custom sources to ledger

Example:

# configs/data/custom.yaml
sources:
  - name: my-custom-dataset
    license: mit  # or your license
    path: /path/to/data
    description: My custom training data

Commercial Use Considerations

NOVA Code

License: Apache 2.0 Commercial Use: ✅ Allowed

Training Data

Depends on dataset:

Dataset	Commercial Use
Wikipedia	✅ Allowed (with attribution)
Project Gutenberg	✅ Allowed (public domain)
OpenWebText	✅ Allowed (CC0)
C4	✅ Allowed (ODC-BY, with attribution)
The Pile (ArXiv)	⚠️ Verify per-subset

Recommendation: Review each dataset's license for commercial projects.

Excluded Sources

Why We Don't Use Certain Data

Common Crawl (raw):

Contains copyrighted material
License status unclear for many pages
We use filtered versions (C4) instead

Social Media (Twitter, etc.):

Terms of Service restrictions
Privacy concerns
Unclear licensing

Books3/LibGen:

Contains copyrighted books
Legal issues
Not permissively licensed

YouTube Subtitles:

Copyright unclear
TOS restrictions

Compliance Checklist

Before training NOVA:

All data sources listed in license_ledger.json
Each source has verified license
Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
Attribution prepared for CC-BY sources
No excluded sources used

Future Datasets

Planned Additions

We're evaluating these sources:

BookCorpus: Open domain books (pending license review)
Stack Exchange: CC-BY-SA (with attribution)
OpenSubtitles: Public domain/permissive subset
Code datasets: GitHub permissive licenses (MIT, Apache, BSD)

Criteria:

Clear, permissive license
High quality
Legally distributable

Dataset Removal Requests

If you believe we've incorrectly listed a dataset:

Open an issue: github.com/yourusername/nova/issues
Include:
- Dataset name
- License concern
- Supporting documentation
We'll review and respond within 7 days

Legal Disclaimer

This project aims for legal compliance, but:

We're not lawyers
License interpretation may vary by jurisdiction
Users are responsible for their own compliance
Consult legal counsel for commercial use

NOVA project provides this information for transparency, but makes no warranties about legal compliance.

References

License Texts

CC-BY 4.0: https://creativecommons.org/licenses/by/4.0/
CC0 1.0: https://creativecommons.org/publicdomain/zero/1.0/
Apache 2.0: https://www.apache.org/licenses/LICENSE-2.0
MIT: https://opensource.org/licenses/MIT
ODC-BY: https://opendatacommons.org/licenses/by/

Resources

Creative Commons: https://creativecommons.org/
Open Data Commons: https://opendatacommons.org/
OSI Licenses: https://opensource.org/licenses

Last Updated: 2025 Document Version: 1.0 Review Frequency: Quarterly

6.7 KiB Raw Permalink Blame History

Data Licenses and Attribution

License Philosophy

What We Use

What We DON'T Use

Approved Dataset Sources

1. Wikipedia (English)

2. Project Gutenberg

3. OpenWebText

4. C4 (Colossal Clean Crawled Corpus)

5. The Pile - ArXiv Subset

License Tracking System

Ledger File

Verification

Attribution Requirements

CC-BY Datasets

Public Domain

Custom Datasets

User-Provided Data

Commercial Use Considerations

NOVA Code

Training Data

Excluded Sources

Why We Don't Use Certain Data

Compliance Checklist

Future Datasets

Planned Additions

Dataset Removal Requests

Legal Disclaimer

References

License Texts

Resources

6.7 KiB

Raw Permalink Blame History