Initial commit: NOVA - Neuro-Optimizing Versatile Agent

Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 20:56:37 -04:00
commit a7f091aa45
50 changed files with 6437 additions and 0 deletions
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -0,0 +1,227 @@
+# Contributing to NOVA
+
+Thank you for your interest in contributing to NOVA! This document provides guidelines for contributing.
+
+---
+
+## How to Contribute
+
+### Reporting Issues
+
+**Bug Reports:**
+1. Check existing issues first
+2. Use the bug report template
+3. Include:
+   - Python version
+   - OS and hardware
+   - Steps to reproduce
+   - Expected vs actual behavior
+   - Error messages/logs
+
+**Feature Requests:**
+1. Check if already proposed
+2. Explain the use case
+3. Describe the desired behavior
+
+### Code Contributions
+
+**Setup Development Environment:**
+
+```bash
+# Fork and clone
+git clone https://github.com/yourusername/nova.git
+cd nova
+
+# Create venv
+python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+
+# Install dev dependencies
+pip install -r requirements.txt
+pip install -e .[dev]
+```
+
+**Before Submitting:**
+
+1. **Run Tests:**
+   ```bash
+   pytest tests/ -v
+   ```
+
+2. **Lint Code:**
+   ```bash
+   ruff check .
+   black --check .
+   ```
+
+3. **Format Code:**
+   ```bash
+   black nova_core/ nova_tokenizer/ nova_train/ nova_evo/ nova_chat/
+   ```
+
+4. **Type Check (optional but recommended):**
+   ```bash
+   mypy nova_core/ --ignore-missing-imports
+   ```
+
+### Pull Request Process
+
+1. **Branch Naming:**
+   - `feature/description` for new features
+   - `fix/description` for bug fixes
+   - `docs/description` for documentation
+
+2. **Commit Messages:**
+   - Clear, descriptive messages
+   - Reference issues: `Fix #123: Description`
+
+3. **PR Description:**
+   - What changed
+   - Why the change
+   - Testing performed
+   - Screenshots (if UI changes)
+
+4. **Review Process:**
+   - CI must pass
+   - At least one approval required
+   - Address review feedback
+
+---
+
+## Development Guidelines
+
+### Code Style
+
+**Python:**
+- Follow PEP 8
+- Use Black formatter (line length 100)
+- Type hints encouraged
+- Docstrings for public APIs
+
+**Example:**
+```python
+def example_function(param: str, optional: int = 0) -> bool:
+    """
+    Brief description.
+
+    Args:
+        param: Description
+        optional: Description (default: 0)
+
+    Returns:
+        Description
+    """
+    return True
+```
+
+### Testing
+
+**Write Tests For:**
+- New features
+- Bug fixes
+- Public APIs
+
+**Test Locations:**
+- `tests/test_core.py` - Core transformer
+- `tests/test_tokenizer.py` - Tokenizer
+- `tests/test_persona.py` - Persona system
+- `tests/test_<module>.py` - Other modules
+
+**Run Tests:**
+```bash
+# All tests
+pytest
+
+# Specific file
+pytest tests/test_core.py
+
+# With coverage
+pytest --cov=nova_core
+```
+
+### Documentation
+
+**Update Docs For:**
+- API changes
+- New features
+- Configuration options
+
+**Documentation Files:**
+- `README.md` - Main documentation
+- `docs/MODEL_CARD.md` - Model information
+- `docs/PRIVACY_LOCAL.md` - Privacy details
+- `docs/DATA_LICENSES.md` - Data licensing
+
+---
+
+## Contribution Areas
+
+### High Priority
+
+- **Pre-trained Models:** Training and releasing checkpoints
+- **Export Tools:** GGUF converter, quantization improvements
+- **Evaluation Suite:** Comprehensive benchmarks
+- **Dataset Downloaders:** Legal dataset acquisition scripts
+
+### Medium Priority
+
+- **LoRA Support:** Fine-tuning with adapters
+- **Multi-language:** Support for non-English
+- **Performance:** Optimization improvements
+- **Tests:** Increase coverage
+
+### Documentation
+
+- **Tutorials:** Step-by-step guides
+- **Examples:** Real-world use cases
+- **API Docs:** Complete API documentation
+- **Architecture:** Deep-dive technical docs
+
+---
+
+## License
+
+By contributing, you agree that your contributions will be licensed under Apache License 2.0.
+
+---
+
+## Code of Conduct
+
+### Our Pledge
+
+- Be respectful and inclusive
+- Welcome newcomers
+- Focus on constructive feedback
+- Assume good intentions
+
+### Unacceptable Behavior
+
+- Harassment or discrimination
+- Trolling or insulting comments
+- Publishing others' private information
+- Other unprofessional conduct
+
+### Enforcement
+
+Violations can be reported to project maintainers. All complaints will be reviewed and investigated.
+
+---
+
+## Questions?
+
+- **Discussions:** GitHub Discussions
+- **Issues:** GitHub Issues
+- **General:** Open an issue with the "question" label
+
+---
+
+## Recognition
+
+Contributors will be:
+- Listed in CONTRIBUTORS.md
+- Mentioned in release notes
+- Credited for significant features
+
+---
+
+Thank you for contributing to NOVA! 🌟
--- a/docs/DATA_LICENSES.md
+++ b/docs/DATA_LICENSES.md
@@ -0,0 +1,315 @@
+# Data Licenses and Attribution
+
+NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.
+
+---
+
+## License Philosophy
+
+### What We Use
+
+✅ **Public Domain:** No restrictions
+✅ **CC0:** Public domain dedication
+✅ **CC-BY:** Attribution required
+✅ **MIT/Apache/BSD:** Permissive open source
+
+### What We DON'T Use
+
+❌ **All Rights Reserved:** Copyrighted without permission
+❌ **CC-BY-NC:** Non-commercial restrictions
+❌ **CC-BY-ND:** No derivatives restrictions
+❌ **Unknown/Unlicensed:** No verified license
+❌ **Scraped Web Data:** Without license verification
+
+---
+
+## Approved Dataset Sources
+
+### 1. Wikipedia (English)
+
+**License:** CC-BY-SA 3.0
+**URL:** https://dumps.wikimedia.org/
+**Size:** ~20 GB (compressed)
+**Language:** English
+**Description:** English Wikipedia articles
+
+**Attribution:**
+> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
+
+**Usage:** Text data for general knowledge
+
+---
+
+### 2. Project Gutenberg
+
+**License:** Public Domain
+**URL:** https://www.gutenberg.org/
+**Size:** ~15 GB
+**Language:** Primarily English
+**Description:** Public domain books (pre-1928 in US)
+
+**Attribution:**
+> Project Gutenberg. Public domain literary works.
+
+**Usage:** Literary text, historical documents
+
+---
+
+### 3. OpenWebText
+
+**License:** CC0 1.0 (Public Domain Dedication)
+**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
+**Size:** ~38 GB
+**Language:** English
+**Description:** Open reproduction of WebText (Reddit links)
+
+**Attribution:**
+> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
+
+**Usage:** Web-scraped text (Reddit-filtered)
+
+---
+
+### 4. C4 (Colossal Clean Crawled Corpus)
+
+**License:** ODC-BY (Open Data Commons Attribution)
+**URL:** https://huggingface.co/datasets/c4
+**Size:** ~300 GB (en subset)
+**Language:** English
+**Description:** Cleaned Common Crawl data
+
+**Attribution:**
+> C4 dataset from Google's T5 paper. ODC-BY license.
+
+**Usage:** Large-scale web text
+
+---
+
+### 5. The Pile - ArXiv Subset
+
+**License:** Various (mostly permissive for ArXiv subset)
+**URL:** https://pile.eleuther.ai/
+**Size:** ~60 GB (ArXiv subset)
+**Language:** English
+**Description:** ArXiv papers (scientific articles)
+
+**Attribution:**
+> The Pile by EleutherAI. ArXiv papers subset.
+
+**Usage:** Scientific and technical text
+
+**Note:** Only use subsets with verified permissive licenses
+
+---
+
+## License Tracking System
+
+### Ledger File
+
+All downloaded datasets tracked in:
+```
+data/processed/license_ledger.json
+```
+
+**Format:**
+```json
+{
+  "sources": [
+    {
+      "name": "wikipedia-en",
+      "license": "cc-by-sa-3.0",
+      "url": "https://dumps.wikimedia.org/enwiki/",
+      "download_date": "2025-01-15",
+      "size_gb": 20.5,
+      "attribution": "Wikipedia contributors..."
+    }
+  ]
+}
+```
+
+### Verification
+
+Before training, verify licenses:
+
+```bash
+python -m nova_data.pipeline verify_licenses
+```
+
+This checks that all data sources have approved licenses.
+
+---
+
+## Attribution Requirements
+
+### CC-BY Datasets
+
+**Required:**
+- Attribute the original creator
+- Include license name
+- Link to license
+- Indicate if changes were made
+
+**Our Attribution:**
+
+All NOVA models trained on CC-BY data include:
+
+> This model was trained on data including:
+> - Wikipedia (CC-BY-SA 3.0)
+> - [Other CC-BY sources]
+>
+> Full attributions in DATA_LICENSES.md
+
+### Public Domain
+
+**Required:** None (but we attribute anyway for transparency)
+
+---
+
+## Custom Datasets
+
+### User-Provided Data
+
+If training NOVA on your own data:
+
+**Your Responsibility:**
+- Ensure you have rights to use the data
+- Verify any license requirements
+- Add custom sources to ledger
+
+**Example:**
+```yaml
+# configs/data/custom.yaml
+sources:
+  - name: my-custom-dataset
+    license: mit  # or your license
+    path: /path/to/data
+    description: My custom training data
+```
+
+---
+
+## Commercial Use Considerations
+
+### NOVA Code
+
+**License:** Apache 2.0
+**Commercial Use:** ✅ Allowed
+
+### Training Data
+
+Depends on dataset:
+
+| Dataset | Commercial Use |
+|---------|---------------|
+| Wikipedia | ✅ Allowed (with attribution) |
+| Project Gutenberg | ✅ Allowed (public domain) |
+| OpenWebText | ✅ Allowed (CC0) |
+| C4 | ✅ Allowed (ODC-BY, with attribution) |
+| The Pile (ArXiv) | ⚠️ Verify per-subset |
+
+**Recommendation:** Review each dataset's license for commercial projects.
+
+---
+
+## Excluded Sources
+
+### Why We Don't Use Certain Data
+
+**Common Crawl (raw):**
+- Contains copyrighted material
+- License status unclear for many pages
+- We use filtered versions (C4) instead
+
+**Social Media (Twitter, etc.):**
+- Terms of Service restrictions
+- Privacy concerns
+- Unclear licensing
+
+**Books3/LibGen:**
+- Contains copyrighted books
+- Legal issues
+- Not permissively licensed
+
+**YouTube Subtitles:**
+- Copyright unclear
+- TOS restrictions
+
+---
+
+## Compliance Checklist
+
+Before training NOVA:
+
+- [ ] All data sources listed in `license_ledger.json`
+- [ ] Each source has verified license
+- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
+- [ ] Attribution prepared for CC-BY sources
+- [ ] No excluded sources used
+
+---
+
+## Future Datasets
+
+### Planned Additions
+
+We're evaluating these sources:
+
+- **BookCorpus:** Open domain books (pending license review)
+- **Stack Exchange:** CC-BY-SA (with attribution)
+- **OpenSubtitles:** Public domain/permissive subset
+- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)
+
+**Criteria:**
+- Clear, permissive license
+- High quality
+- Legally distributable
+
+---
+
+## Dataset Removal Requests
+
+If you believe we've incorrectly listed a dataset:
+
+1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
+2. Include:
+   - Dataset name
+   - License concern
+   - Supporting documentation
+3. We'll review and respond within 7 days
+
+---
+
+## Legal Disclaimer
+
+**This project aims for legal compliance, but:**
+
+- We're not lawyers
+- License interpretation may vary by jurisdiction
+- Users are responsible for their own compliance
+- Consult legal counsel for commercial use
+
+**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**
+
+---
+
+## References
+
+### License Texts
+
+- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
+- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
+- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
+- **MIT:** https://opensource.org/licenses/MIT
+- **ODC-BY:** https://opendatacommons.org/licenses/by/
+
+### Resources
+
+- Creative Commons: https://creativecommons.org/
+- Open Data Commons: https://opendatacommons.org/
+- OSI Licenses: https://opensource.org/licenses
+
+---
+
+**Last Updated:** 2025
+**Document Version:** 1.0
+**Review Frequency:** Quarterly
--- a/docs/MODEL_CARD.md
+++ b/docs/MODEL_CARD.md
@@ -0,0 +1,232 @@
+# NOVA Model Card
+
+## Model Details
+
+**Name:** NOVA (Neuro-Optimizing Versatile Agent)
+**Version:** 0.1.0
+**Date:** 2025
+**License:** Apache 2.0
+**Type:** Decoder-only transformer language model
+
+### Model Sizes
+
+NOVA comes in four sizes:
+
+| Size | Parameters | Layers | Hidden Size | Attention Heads | Context Length |
+|------|-----------|--------|-------------|-----------------|----------------|
+| 125M | 125M      | 12     | 768         | 12              | 2048           |
+| 350M | 350M      | 24     | 1024        | 16              | 2048           |
+| 1.3B | 1.3B      | 24     | 2048        | 32 (8 KV)       | 2048           |
+| 3B   | 3B        | 32     | 2560        | 32 (8 KV)       | 4096           |
+
+### Architecture
+
+- **Positional Encoding:** RoPE (Rotary Position Embedding)
+- **Normalization:** RMSNorm (default) or LayerNorm
+- **Activation:** SwiGLU (default), GeGLU, or GELU
+- **Attention:** Multi-head with optional grouped-query attention (GQA)
+- **Features:** KV-cache, gradient checkpointing, Flash Attention support
+
+## Intended Use
+
+### Primary Use Cases
+
+- **Personal companion AI:** Conversational agent with customizable personas
+- **Local inference:** Privacy-focused applications on consumer hardware
+- **Research:** Transformer architecture experimentation
+- **Education:** Learning about modern LLM implementation
+
+### Out of Scope
+
+- **Production deployment without safety measures:** Additional content filtering recommended
+- **High-stakes decisions:** Not suitable for medical, legal, or financial advice
+- **Scalable services:** Designed for local/personal use, not cloud deployment
+
+## Training Data
+
+NOVA uses **only legally licensed datasets**:
+
+### Approved Sources
+
+- **Public Domain:** Project Gutenberg books
+- **CC0/CC-BY:** Wikipedia, OpenWebText, C4 corpus
+- **Open Licensed:** The Pile (ArXiv), OSI-approved code datasets
+
+### License Tracking
+
+All training data sources logged in `license_ledger.json` with:
+- Source name and URL
+- License type
+- Download date
+- Data provenance
+
+### Exclusions
+
+- No scraped data without verified licenses
+- No copyrighted material
+- No personally identifiable information (PII)
+- No user data without explicit consent
+
+## Training Procedure
+
+### Hyperparameters
+
+Default training configuration (125M):
+
+```yaml
+batch_size: 8
+gradient_accumulation: 4
+learning_rate: 3e-4
+weight_decay: 0.1
+warmup_steps: 1000
+max_steps: 100000
+optimizer: AdamW
+lr_schedule: cosine with warmup
+```
+
+### Hardware
+
+- **Minimum:** CPU (4+ cores), 8GB RAM
+- **Recommended:** NVIDIA GPU (8GB+ VRAM), 16GB+ RAM
+- **Optimal:** NVIDIA GPU (24GB+ VRAM), 32GB+ RAM
+
+### Optimizations
+
+- **Mixed Precision:** AMP (Automatic Mixed Precision) on GPU
+- **Gradient Checkpointing:** Reduces memory usage
+- **Distributed Training:** DDP (DistributedDataParallel) support
+
+## Evaluation
+
+### Metrics
+
+- **Perplexity:** Language modeling quality
+- **Latency:** Inference speed (tokens/second)
+- **Memory:** Peak RAM/VRAM usage
+- **Persona Adherence:** Style consistency with selected persona
+
+### Benchmarks
+
+(To be added as pre-trained models become available)
+
+## Persona System
+
+### Design Philosophy
+
+NOVA includes a **personality matrix** system for controllable conversational style:
+
+- **No AI Disclosure by Default:** `always_disclose: false`
+- **Private Use Context:** Designed for personal, local deployment
+- **Customizable:** Users can create custom personas
+
+### Personality Traits
+
+Eight traits (0.0-1.0) that modulate generation:
+
+1. Warmth
+2. Humor
+3. Empathy
+4. Decisiveness
+5. Creativity
+6. Intimacy
+7. Playfulness
+8. Formality
+
+### Default Personas
+
+- **girlfriend_gentle:** High warmth, high empathy
+- **girlfriend_playful:** High humor, high playfulness
+- **girlfriend_supportive:** Balanced traits (default)
+
+## Ethical Considerations
+
+### Privacy
+
+- **Local-First:** All processing on-device
+- **No Telemetry:** Zero data collection
+- **User Control:** Complete control over data and models
+
+### Bias and Fairness
+
+- **Training Data Bias:** Inherits biases from source datasets
+- **Mitigation:** Use diverse, openly licensed sources
+- **Ongoing Work:** Bias evaluation and mitigation strategies
+
+### Content Safety
+
+- **Basic Filters:** Profanity and unsafe content detection
+- **Limitations:** Not a complete safety solution
+- **Recommendation:** Additional filtering for public-facing use
+
+### AI Disclosure
+
+- **Configurable:** `always_disclose` setting in persona config
+- **Default:** False (for private, personal use)
+- **Recommendation:** Enable for any public or shared deployment
+
+## Limitations
+
+### Technical
+
+- **Small Context:** 2048-4096 tokens (not suitable for long documents)
+- **Compute:** Smaller models may have lower quality than larger LLMs
+- **Hallucination:** May generate factually incorrect information
+
+### Use Case
+
+- **Not a knowledge base:** May not have up-to-date information
+- **Not a specialist:** General-purpose, not domain-specific
+- **Not production-ready (as-is):** Requires additional safety/filtering
+
+## Evolutionary Algorithm (NOVA-EVO)
+
+### Purpose
+
+Optional genetic algorithm for automatic configuration optimization:
+
+- **Hyperparameter Search:** Learning rate, batch size, warmup
+- **Architecture Search:** Activation, normalization, positional encoding
+- **Multi-Objective:** Optimizes loss, latency, memory simultaneously
+
+### Fitness Metrics
+
+- **Loss/Perplexity:** (50% weight)
+- **Latency:** (20% weight)
+- **Memory:** (20% weight)
+- **Quality:** (10% weight)
+
+### Compute Budget
+
+- **Small:** 20 individuals, 10 generations (~6-12 hours)
+- **Medium:** 40 individuals, 20 generations (~24-48 hours)
+- **Large:** 100 individuals, 50 generations (~1-2 weeks)
+
+## Contact
+
+For questions, issues, or contributions:
+
+- **GitHub:** [github.com/yourusername/nova](https://github.com/yourusername/nova)
+- **Issues:** [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
+
+## Citation
+
+```bibtex
+@software{nova2025,
+  title={NOVA: Neuro-Optimizing Versatile Agent},
+  author={NOVA Project Contributors},
+  year={2025},
+  url={https://github.com/yourusername/nova},
+  license={Apache-2.0}
+}
+```
+
+## Acknowledgments
+
+- Transformer architecture inspired by GPT, LLaMA, and modern LLM research
+- RoPE, RMSNorm, SwiGLU from recent papers (Su et al., Zhang et al., Shazeer et al.)
+- Open source community for datasets and tools
+
+---
+
+**Last Updated:** 2025
+**Model Card Version:** 1.0
--- a/docs/PRIVACY_LOCAL.md
+++ b/docs/PRIVACY_LOCAL.md
@@ -0,0 +1,330 @@
+# Privacy and Local Use
+
+## NOVA Privacy Statement
+
+NOVA is designed as a **local-first, privacy-focused** language model. This document explains how NOVA handles your data.
+
+---
+
+## Core Principles
+
+### 1. Local-First
+
+**Everything runs on your device.**
+
+- Model inference happens locally
+- Training data stays on your machine
+- No cloud dependencies
+- No internet required (except for dataset downloads)
+
+### 2. Zero Telemetry
+
+**NOVA collects zero data.**
+
+- No usage tracking
+- No error reporting
+- No analytics
+- No phone-home functionality
+
+### 3. Complete User Control
+
+**You own everything.**
+
+- Your conversations
+- Your trained models
+- Your custom personas
+- Your data
+
+---
+
+## Data Storage
+
+### Where Your Data Lives
+
+```
+C:\Development\Nova\
+├── memory.db                    # Your conversation history (SQLite)
+├── checkpoints/                 # Your trained models
+├── data/                        # Your training data
+└── configs/persona/             # Your custom personas
+```
+
+**All on your device. Never uploaded.**
+
+### Conversation Memory
+
+- **Location:** `memory.db` (SQLite database)
+- **Contents:** Your chat history
+- **Encryption:** Not encrypted by default (it's local)
+- **Deletion:** Delete `memory.db` file to erase all history
+- **Recommendation:** Encrypt your drive if sharing the device
+
+### Model Checkpoints
+
+- **Location:** `checkpoints/` directory
+- **Contents:** Model weights and training state
+- **Sharing:** Safe to share (contains no personal data)
+
+---
+
+## Network Activity
+
+### When NOVA Uses the Internet
+
+NOVA **only** uses the internet for:
+
+1. **Dataset Downloads:** Downloading legal training datasets (opt-in)
+2. **Optional:** Downloading pre-trained weights (if available)
+
+### When NOVA Does NOT Use Internet
+
+- **Chat inference:** 100% offline
+- **Model training:** 100% offline
+- **Persona customization:** 100% offline
+- **Evolution (NOVA-EVO):** 100% offline
+
+### Firewall Safety
+
+NOVA is safe to run behind a firewall with no internet access (after initial setup).
+
+---
+
+## AI Disclosure Setting
+
+### `always_disclose` Flag
+
+NOVA personas have an `always_disclose` setting:
+
+```yaml
+always_disclose: false  # Default
+```
+
+**What this means:**
+
+- `false` (default): NOVA does NOT disclose being AI
+  - Designed for **private, personal use**
+  - Appropriate for local companion scenarios
+
+- `true`: NOVA includes AI disclosure text
+  - Recommended for **shared or public use**
+  - Adds transparency about AI nature
+
+### When to Enable Disclosure
+
+✅ **Enable `always_disclose: true` if:**
+- Sharing NOVA with others
+- Deploying publicly (e.g., website, app)
+- Any scenario where users might not know it's AI
+
+❌ **Keep `always_disclose: false` if:**
+- Personal, private use on your own device
+- You're fully aware it's a language model
+- Testing/development
+
+**Default:** False (personal use assumption)
+
+---
+
+## Persona System Privacy
+
+### Personality Matrix
+
+The personality matrix (warmth, humor, empathy, etc.) is:
+
+- **Stored:** In persona YAML files
+- **Processed:** Locally during generation
+- **Shared:** Never (unless you share the files)
+
+### Custom Personas
+
+Your custom persona configurations:
+
+- **Location:** `configs/persona/` directory
+- **Format:** YAML (human-readable text)
+- **Privacy:** Stored locally, never transmitted
+
+---
+
+## Training Data Privacy
+
+### Legal Data Only
+
+NOVA enforces **legal-only datasets**:
+
+- Public domain sources
+- Openly licensed datasets (CC0, CC-BY, MIT, Apache)
+- License tracking in `license_ledger.json`
+
+**No private data scraping.**
+
+### Your Own Data
+
+If you train NOVA on your own data:
+
+- **Stays local:** Never leaves your device
+- **Your responsibility:** Ensure you have rights to use it
+- **Recommendation:** Don't train on sensitive/private data you don't want in the model
+
+---
+
+## Security Considerations
+
+### Running NOVA Safely
+
+✅ **Do:**
+- Run on a trusted device
+- Keep your OS and Python dependencies updated
+- Use filesystem encryption if device is shared
+- Review code before running (it's open source!)
+
+⚠️ **Don't:**
+- Expose the REST API to the internet without authentication
+- Train on sensitive data you can't afford to leak
+- Share `memory.db` if it contains private conversations
+
+### REST API Security
+
+If using the REST API (`nova chat serve`):
+
+- **Default:** Binds to `0.0.0.0:8000` (all interfaces)
+- **Recommendation:** Use `--host 127.0.0.1` for local-only
+- **Authentication:** Not included (add if exposing externally)
+- **HTTPS:** Not included (add if exposing externally)
+
+**For personal use:** Keep localhost-only.
+**For shared use:** Add authentication, HTTPS, rate limiting.
+
+---
+
+## Data Deletion
+
+### Clear All Conversations
+
+```bash
+# Delete conversation database
+rm memory.db
+
+# Or programmatically
+from nova_chat import ConversationMemory
+memory = ConversationMemory()
+memory.clear_all()
+```
+
+### Remove Models
+
+```bash
+# Delete checkpoints
+rm -rf checkpoints/
+```
+
+### Complete Reset
+
+```bash
+# Remove all data
+rm -rf data/ checkpoints/ memory.db
+```
+
+---
+
+## Third-Party Dependencies
+
+NOVA uses standard open-source libraries:
+
+- **PyTorch:** ML framework
+- **SentencePiece:** Tokenization
+- **FastAPI/Uvicorn:** REST API (optional)
+- **SQLite:** Conversation storage
+
+**All are open source and widely audited.**
+
+### Dependency Privacy
+
+- PyTorch: No telemetry (when installed normally)
+- SentencePiece: No telemetry
+- FastAPI: No telemetry
+- SQLite: Local database, no telemetry
+
+---
+
+## Comparison to Cloud LLMs
+
+| Feature | NOVA | Cloud LLMs |
+|---------|------|------------|
+| **Data Location** | Your device | Company servers |
+| **Privacy** | Complete | Varies by provider |
+| **Telemetry** | None | Usually tracked |
+| **Internet Required** | No (after setup) | Yes |
+| **Cost** | One-time (hardware) | Per-token/monthly |
+| **Customization** | Full control | Limited |
+| **Data Retention** | Your choice | Company policy |
+
+---
+
+## Transparency
+
+### Open Source
+
+NOVA is **fully open source** under Apache 2.0:
+
+- **Source code:** Fully auditable
+- **No hidden functionality:** What you see is what you get
+- **Community review:** Anyone can inspect for privacy issues
+
+### No Hidden Behavior
+
+NOVA does **not**:
+- Phone home
+- Send analytics
+- Track usage
+- Report errors to external services
+- Auto-update without your action
+
+---
+
+## Recommendations
+
+### For Maximum Privacy
+
+1. **Offline Mode:** Disable network after downloading dependencies
+2. **Encrypt Storage:** Use full-disk encryption (BitLocker, FileVault, LUKS)
+3. **Regular Cleanup:** Clear `memory.db` periodically if desired
+4. **Review Code:** Inspect the source before running
+
+### For Shared Devices
+
+1. **Enable Disclosure:** Set `always_disclose: true`
+2. **Separate Accounts:** Use OS user accounts to isolate data
+3. **Clear Conversations:** Delete history after sessions
+
+### For Development
+
+1. **Test Data Only:** Don't use real sensitive data for testing
+2. **Version Control:** Add `memory.db` and `checkpoints/` to `.gitignore`
+
+---
+
+## Contact for Privacy Concerns
+
+If you find privacy issues:
+
+- **GitHub Issues:** [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
+- **Security:** Tag issues with `security` label
+
+---
+
+## Summary
+
+**NOVA is designed for local, private use.**
+
+✅ No data collection
+✅ No telemetry
+✅ No cloud dependencies
+✅ Complete user control
+✅ Open source and auditable
+
+**Your data stays on your device.**
+
+---
+
+**Last Updated:** 2025
+**Document Version:** 1.0