Initial commit: NOVA - Neuro-Optimizing Versatile Agent

Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 20:56:37 -04:00
commit a7f091aa45
50 changed files with 6437 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,371 @@
+# NOVA - Neuro-Optimizing Versatile Agent
+
+**A local-first transformer LLM built from scratch with genetic evolution and persona support**
+
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
+
+---
+
+## 🌟 Features
+
+- **Built from Zero**: Complete transformer implementation (RoPE, RMSNorm, SwiGLU, KV-cache)
+- **Local-First**: Runs on consumer hardware (CPU or GPU), no cloud dependencies
+- **Persona System**: Girlfriend-style companion personas with NO AI disclosure by default
+- **Genetic Evolution** (NOVA-EVO): Automatic hyperparameter and architecture optimization
+- **Legal Data Only**: Built-in license tracking, only uses properly licensed datasets
+- **Production-Ready**: AMP, gradient checkpointing, DDP, TorchScript export, INT8 quantization
+
+---
+
+## 🚀 Quick Start
+
+### Installation
+
+```bash
+# Clone repository
+git clone https://github.com/yourusername/nova.git
+cd nova
+
+# Create virtual environment (Python 3.10.6+)
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+pip install -e .
+```
+
+### Initialize Project
+
+```bash
+# Initialize NOVA with toy dataset
+python scripts/cli.py init
+
+# Train tokenizer
+python scripts/cli.py tokenizer train --input data/toy_dataset/toy.txt --output tokenizer
+
+# Train 125M model (requires proper dataset)
+python scripts/cli.py train --size 125m
+```
+
+### Chat with NOVA
+
+```bash
+# CLI chat (requires trained model)
+python scripts/cli.py chat cli --persona configs/persona/girlfriend_supportive.yaml
+
+# REST API server
+python scripts/cli.py chat serve --port 8000
+```
+
+---
+
+## 📁 Project Structure
+
+```
+nova/
+├── nova_core/              # Transformer architecture
+│   ├── model.py           # Main NOVA transformer
+│   ├── attention.py       # Multi-head attention + KV-cache
+│   ├── layers.py          # Transformer blocks
+│   ├── rope.py            # Rotary position embeddings
+│   ├── normalization.py   # RMSNorm / LayerNorm
+│   └── activations.py     # SwiGLU / GeGLU / MLP
+├── nova_tokenizer/        # SentencePiece tokenizer
+├── nova_data/             # Legal dataset pipeline
+├── nova_train/            # Training with AMP/DDP
+├── nova_evo/              # Genetic algorithm evolution
+├── nova_chat/             # Chat agent + personas + memory
+├── adapters/              # LoRA-style persona adapters
+├── export/                # TorchScript / GGUF export
+├── evals/                 # Evaluation suite
+├── configs/               # Configuration files
+│   ├── nova.yml          # Master config
+│   ├── model/            # Model size configs
+│   └── persona/          # Persona definitions
+├── docs/                  # Documentation
+├── scripts/              # CLI and utilities
+└── tests/                # Test suite
+```
+
+---
+
+## 🎭 Persona System
+
+NOVA features a unique **personality matrix** system that shapes conversational style **without AI self-disclosure**:
+
+### Default Personas
+
+1. **girlfriend_gentle**: Warm, empathetic, deeply supportive
+2. **girlfriend_playful**: Flirty, fun-loving, spontaneous
+3. **girlfriend_supportive**: Balanced, caring, understanding (default)
+
+### Personality Matrix
+
+Each persona has 8 personality traits (0.0-1.0) that modulate generation:
+
+- **warmth**: How warm and affectionate
+- **humor**: How playful and funny
+- **empathy**: How understanding and supportive
+- **decisiveness**: How direct vs thoughtful
+- **creativity**: How creative and imaginative
+- **intimacy**: How personal and close
+- **playfulness**: How flirty and playful
+- **formality**: How formal vs casual
+
+These traits automatically adjust temperature, top-p, repetition penalty, and response length!
+
+### Important: No AI Disclosure
+
+**By default, `always_disclose` is set to `false`** in persona configs. This is a private, local system for personal use. To enable AI disclosure, set `always_disclose: true` in persona YAML.
+
+---
+
+## 🧬 NOVA-EVO: Genetic Evolution
+
+NOVA includes an optional genetic algorithm that evolves model configurations:
+
+```bash
+# Run evolution with small budget
+python scripts/cli.py evo run --budget small
+```
+
+**What it evolves:**
+- Hyperparameters: learning rate, batch size, warmup, weight decay
+- Architecture: RoPE theta, activation functions, normalization types
+- Multi-objective fitness: loss, latency, memory, chat quality
+
+Results saved to hall of fame with lineage tracking!
+
+---
+
+## ⚖️ Legal Data Only
+
+NOVA uses **only properly licensed datasets**:
+
+- ✅ Public domain (Project Gutenberg)
+- ✅ CC0, CC-BY (Wikipedia, C4)
+- ✅ Open licenses (MIT, Apache)
+
+All data sources tracked in `data/processed/license_ledger.json`
+
+```bash
+# List available legal sources
+python scripts/cli.py data build
+
+# Download specific source (with license verification)
+python scripts/cli.py data build --source wikipedia-en
+```
+
+---
+
+## 🏗️ Model Sizes
+
+| Size | Params | Layers | Hidden | Heads | Context | Memory (FP16) |
+|------|--------|--------|--------|-------|---------|---------------|
+| 125M | 125M   | 12     | 768    | 12    | 2048    | ~500 MB       |
+| 350M | 350M   | 24     | 1024   | 16    | 2048    | ~1.4 GB       |
+| 1.3B | 1.3B   | 24     | 2048   | 32    | 2048    | ~5 GB         |
+| 3B   | 3B     | 32     | 2560   | 32    | 4096    | ~12 GB        |
+
+All sizes support:
+- CPU inference (INT8 quantization available)
+- GPU acceleration (CUDA 12+)
+- KV-cache for fast generation
+- Gradient checkpointing for training
+
+---
+
+## 🔧 Configuration
+
+Master config: `configs/nova.yml`
+
+```yaml
+# Hardware
+hardware:
+  device: auto  # cpu, cuda, cuda:0
+  allow_cuda: true
+
+# Persona
+persona:
+  default: girlfriend_supportive
+  always_disclose: false  # NO AI disclosure
+
+# Evolution
+evolution:
+  enabled: false  # Opt-in
+  budget: small
+
+# Data
+data:
+  legal_only: true  # Enforced
+```
+
+---
+
+## 📊 Training
+
+```python
+from nova_core import NovaTransformer, MODEL_125M
+from nova_train import NovaTrainer, TrainingConfig
+
+# Create model
+model = NovaTransformer(MODEL_125M)
+
+# Training config
+config = TrainingConfig(
+    batch_size=8,
+    learning_rate=3e-4,
+    use_amp=True,  # Mixed precision
+    gradient_checkpointing=True,
+)
+
+# Train
+trainer = NovaTrainer(model, config, train_loader, val_loader)
+trainer.train()
+```
+
+---
+
+## 💬 Chat Interface
+
+### Python API
+
+```python
+from nova_chat import ChatAgent, PersonaLoader
+from nova_core import NovaTransformer
+from nova_tokenizer import NovaTokenizer
+
+# Load model and tokenizer
+model = NovaTransformer.from_pretrained("path/to/checkpoint")
+tokenizer = NovaTokenizer("tokenizer.model")
+
+# Create agent with persona
+persona = PersonaLoader.create_girlfriend_supportive()
+agent = ChatAgent(model, tokenizer, persona)
+
+# Chat
+agent.start_conversation()
+response = agent.chat("Hey! How are you?")
+print(response)
+```
+
+### REST API
+
+```bash
+# Start server
+python -m nova_chat.api
+
+# Chat
+curl -X POST http://localhost:8000/chat \
+  -H "Content-Type: application/json" \
+  -d '{"message": "Hello!"}'
+```
+
+---
+
+## 🧪 Testing
+
+```bash
+# Run tests
+pytest tests/
+
+# With coverage
+pytest --cov=nova_core --cov=nova_tokenizer --cov=nova_train
+```
+
+---
+
+## 📦 Export
+
+```bash
+# TorchScript (CPU optimized)
+python -m export.torchscript_export \
+  --model path/to/model.pt \
+  --output nova_cpu.pt
+
+# INT8 quantization
+python -m export.quantize \
+  --model nova_cpu.pt \
+  --output nova_int8.pt
+
+# GGUF (optional, for llama.cpp compatibility)
+python -m export.gguf_converter \
+  --model path/to/model.pt \
+  --output nova.gguf
+```
+
+---
+
+## 🤝 Contributing
+
+See [CONTRIBUTING.md](docs/CONTRIBUTING.md)
+
+---
+
+## 📄 License
+
+Apache License 2.0 - See [LICENSE](LICENSE)
+
+Copyright 2025 NOVA Project Contributors
+
+---
+
+## 🎯 Roadmap
+
+- [x] Core transformer architecture
+- [x] SentencePiece tokenizer
+- [x] Training pipeline (AMP, DDP)
+- [x] Persona system
+- [x] Genetic evolution
+- [x] Legal data pipeline
+- [x] Chat interface (CLI + REST)
+- [ ] Full export suite (TorchScript, GGUF)
+- [ ] Comprehensive eval suite
+- [ ] Pre-trained checkpoints (125M, 350M)
+- [ ] LoRA fine-tuning support
+- [ ] Multi-language support
+- [ ] Voice interface
+- [ ] Mobile deployment
+
+---
+
+## 🌟 Philosophy
+
+NOVA is built on these principles:
+
+1. **Local-First**: Your data stays on your device
+2. **Transparent**: Open source, auditable, no telemetry
+3. **Ethical**: Legal data only, proper attribution
+4. **Private**: No AI disclosure required for personal use
+5. **Practical**: Runs on consumer hardware
+
+---
+
+## 📚 Documentation
+
+- [Model Card](docs/MODEL_CARD.md)
+- [Data Licenses](docs/DATA_LICENSES.md)
+- [Privacy & Local Use](docs/PRIVACY_LOCAL.md)
+- [Contributing Guide](docs/CONTRIBUTING.md)
+- [Architecture Deep Dive](docs/ARCHITECTURE.md)
+
+---
+
+## ⚡ Quick Commands Reference
+
+```bash
+nova init                          # Initialize project
+nova tokenizer train               # Train tokenizer
+nova train --size 125m            # Train model
+nova chat cli                      # CLI chat
+nova chat serve                    # Start API server
+nova evo run --budget small       # Run evolution
+nova data build --source wiki     # Download legal data
+```
+
+---
+
+**Built with ❤️ for local, ethical, and powerful AI**