Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
331 lines
7.3 KiB
Markdown
331 lines
7.3 KiB
Markdown
# Privacy and Local Use
|
|
|
|
## NOVA Privacy Statement
|
|
|
|
NOVA is designed as a **local-first, privacy-focused** language model. This document explains how NOVA handles your data.
|
|
|
|
---
|
|
|
|
## Core Principles
|
|
|
|
### 1. Local-First
|
|
|
|
**Everything runs on your device.**
|
|
|
|
- Model inference happens locally
|
|
- Training data stays on your machine
|
|
- No cloud dependencies
|
|
- No internet required (except for dataset downloads)
|
|
|
|
### 2. Zero Telemetry
|
|
|
|
**NOVA collects zero data.**
|
|
|
|
- No usage tracking
|
|
- No error reporting
|
|
- No analytics
|
|
- No phone-home functionality
|
|
|
|
### 3. Complete User Control
|
|
|
|
**You own everything.**
|
|
|
|
- Your conversations
|
|
- Your trained models
|
|
- Your custom personas
|
|
- Your data
|
|
|
|
---
|
|
|
|
## Data Storage
|
|
|
|
### Where Your Data Lives
|
|
|
|
```
|
|
C:\Development\Nova\
|
|
├── memory.db # Your conversation history (SQLite)
|
|
├── checkpoints/ # Your trained models
|
|
├── data/ # Your training data
|
|
└── configs/persona/ # Your custom personas
|
|
```
|
|
|
|
**All on your device. Never uploaded.**
|
|
|
|
### Conversation Memory
|
|
|
|
- **Location:** `memory.db` (SQLite database)
|
|
- **Contents:** Your chat history
|
|
- **Encryption:** Not encrypted by default (it's local)
|
|
- **Deletion:** Delete `memory.db` file to erase all history
|
|
- **Recommendation:** Encrypt your drive if sharing the device
|
|
|
|
### Model Checkpoints
|
|
|
|
- **Location:** `checkpoints/` directory
|
|
- **Contents:** Model weights and training state
|
|
- **Sharing:** Safe to share (contains no personal data)
|
|
|
|
---
|
|
|
|
## Network Activity
|
|
|
|
### When NOVA Uses the Internet
|
|
|
|
NOVA **only** uses the internet for:
|
|
|
|
1. **Dataset Downloads:** Downloading legal training datasets (opt-in)
|
|
2. **Optional:** Downloading pre-trained weights (if available)
|
|
|
|
### When NOVA Does NOT Use Internet
|
|
|
|
- **Chat inference:** 100% offline
|
|
- **Model training:** 100% offline
|
|
- **Persona customization:** 100% offline
|
|
- **Evolution (NOVA-EVO):** 100% offline
|
|
|
|
### Firewall Safety
|
|
|
|
NOVA is safe to run behind a firewall with no internet access (after initial setup).
|
|
|
|
---
|
|
|
|
## AI Disclosure Setting
|
|
|
|
### `always_disclose` Flag
|
|
|
|
NOVA personas have an `always_disclose` setting:
|
|
|
|
```yaml
|
|
always_disclose: false # Default
|
|
```
|
|
|
|
**What this means:**
|
|
|
|
- `false` (default): NOVA does NOT disclose being AI
|
|
- Designed for **private, personal use**
|
|
- Appropriate for local companion scenarios
|
|
|
|
- `true`: NOVA includes AI disclosure text
|
|
- Recommended for **shared or public use**
|
|
- Adds transparency about AI nature
|
|
|
|
### When to Enable Disclosure
|
|
|
|
✅ **Enable `always_disclose: true` if:**
|
|
- Sharing NOVA with others
|
|
- Deploying publicly (e.g., website, app)
|
|
- Any scenario where users might not know it's AI
|
|
|
|
❌ **Keep `always_disclose: false` if:**
|
|
- Personal, private use on your own device
|
|
- You're fully aware it's a language model
|
|
- Testing/development
|
|
|
|
**Default:** False (personal use assumption)
|
|
|
|
---
|
|
|
|
## Persona System Privacy
|
|
|
|
### Personality Matrix
|
|
|
|
The personality matrix (warmth, humor, empathy, etc.) is:
|
|
|
|
- **Stored:** In persona YAML files
|
|
- **Processed:** Locally during generation
|
|
- **Shared:** Never (unless you share the files)
|
|
|
|
### Custom Personas
|
|
|
|
Your custom persona configurations:
|
|
|
|
- **Location:** `configs/persona/` directory
|
|
- **Format:** YAML (human-readable text)
|
|
- **Privacy:** Stored locally, never transmitted
|
|
|
|
---
|
|
|
|
## Training Data Privacy
|
|
|
|
### Legal Data Only
|
|
|
|
NOVA enforces **legal-only datasets**:
|
|
|
|
- Public domain sources
|
|
- Openly licensed datasets (CC0, CC-BY, MIT, Apache)
|
|
- License tracking in `license_ledger.json`
|
|
|
|
**No private data scraping.**
|
|
|
|
### Your Own Data
|
|
|
|
If you train NOVA on your own data:
|
|
|
|
- **Stays local:** Never leaves your device
|
|
- **Your responsibility:** Ensure you have rights to use it
|
|
- **Recommendation:** Don't train on sensitive/private data you don't want in the model
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### Running NOVA Safely
|
|
|
|
✅ **Do:**
|
|
- Run on a trusted device
|
|
- Keep your OS and Python dependencies updated
|
|
- Use filesystem encryption if device is shared
|
|
- Review code before running (it's open source!)
|
|
|
|
⚠️ **Don't:**
|
|
- Expose the REST API to the internet without authentication
|
|
- Train on sensitive data you can't afford to leak
|
|
- Share `memory.db` if it contains private conversations
|
|
|
|
### REST API Security
|
|
|
|
If using the REST API (`nova chat serve`):
|
|
|
|
- **Default:** Binds to `0.0.0.0:8000` (all interfaces)
|
|
- **Recommendation:** Use `--host 127.0.0.1` for local-only
|
|
- **Authentication:** Not included (add if exposing externally)
|
|
- **HTTPS:** Not included (add if exposing externally)
|
|
|
|
**For personal use:** Keep localhost-only.
|
|
**For shared use:** Add authentication, HTTPS, rate limiting.
|
|
|
|
---
|
|
|
|
## Data Deletion
|
|
|
|
### Clear All Conversations
|
|
|
|
```bash
|
|
# Delete conversation database
|
|
rm memory.db
|
|
|
|
# Or programmatically
|
|
from nova_chat import ConversationMemory
|
|
memory = ConversationMemory()
|
|
memory.clear_all()
|
|
```
|
|
|
|
### Remove Models
|
|
|
|
```bash
|
|
# Delete checkpoints
|
|
rm -rf checkpoints/
|
|
```
|
|
|
|
### Complete Reset
|
|
|
|
```bash
|
|
# Remove all data
|
|
rm -rf data/ checkpoints/ memory.db
|
|
```
|
|
|
|
---
|
|
|
|
## Third-Party Dependencies
|
|
|
|
NOVA uses standard open-source libraries:
|
|
|
|
- **PyTorch:** ML framework
|
|
- **SentencePiece:** Tokenization
|
|
- **FastAPI/Uvicorn:** REST API (optional)
|
|
- **SQLite:** Conversation storage
|
|
|
|
**All are open source and widely audited.**
|
|
|
|
### Dependency Privacy
|
|
|
|
- PyTorch: No telemetry (when installed normally)
|
|
- SentencePiece: No telemetry
|
|
- FastAPI: No telemetry
|
|
- SQLite: Local database, no telemetry
|
|
|
|
---
|
|
|
|
## Comparison to Cloud LLMs
|
|
|
|
| Feature | NOVA | Cloud LLMs |
|
|
|---------|------|------------|
|
|
| **Data Location** | Your device | Company servers |
|
|
| **Privacy** | Complete | Varies by provider |
|
|
| **Telemetry** | None | Usually tracked |
|
|
| **Internet Required** | No (after setup) | Yes |
|
|
| **Cost** | One-time (hardware) | Per-token/monthly |
|
|
| **Customization** | Full control | Limited |
|
|
| **Data Retention** | Your choice | Company policy |
|
|
|
|
---
|
|
|
|
## Transparency
|
|
|
|
### Open Source
|
|
|
|
NOVA is **fully open source** under Apache 2.0:
|
|
|
|
- **Source code:** Fully auditable
|
|
- **No hidden functionality:** What you see is what you get
|
|
- **Community review:** Anyone can inspect for privacy issues
|
|
|
|
### No Hidden Behavior
|
|
|
|
NOVA does **not**:
|
|
- Phone home
|
|
- Send analytics
|
|
- Track usage
|
|
- Report errors to external services
|
|
- Auto-update without your action
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### For Maximum Privacy
|
|
|
|
1. **Offline Mode:** Disable network after downloading dependencies
|
|
2. **Encrypt Storage:** Use full-disk encryption (BitLocker, FileVault, LUKS)
|
|
3. **Regular Cleanup:** Clear `memory.db` periodically if desired
|
|
4. **Review Code:** Inspect the source before running
|
|
|
|
### For Shared Devices
|
|
|
|
1. **Enable Disclosure:** Set `always_disclose: true`
|
|
2. **Separate Accounts:** Use OS user accounts to isolate data
|
|
3. **Clear Conversations:** Delete history after sessions
|
|
|
|
### For Development
|
|
|
|
1. **Test Data Only:** Don't use real sensitive data for testing
|
|
2. **Version Control:** Add `memory.db` and `checkpoints/` to `.gitignore`
|
|
|
|
---
|
|
|
|
## Contact for Privacy Concerns
|
|
|
|
If you find privacy issues:
|
|
|
|
- **GitHub Issues:** [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
|
|
- **Security:** Tag issues with `security` label
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**NOVA is designed for local, private use.**
|
|
|
|
✅ No data collection
|
|
✅ No telemetry
|
|
✅ No cloud dependencies
|
|
✅ Complete user control
|
|
✅ Open source and auditable
|
|
|
|
**Your data stays on your device.**
|
|
|
|
---
|
|
|
|
**Last Updated:** 2025
|
|
**Document Version:** 1.0
|