# NOVA - Neuro-Optimizing Versatile Agent **A local-first transformer LLM built from scratch with genetic evolution and persona support** [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/) --- ## ๐ŸŒŸ Features - **Built from Zero**: Complete transformer implementation (RoPE, RMSNorm, SwiGLU, KV-cache) - **Local-First**: Runs on consumer hardware (CPU or GPU), no cloud dependencies - **Persona System**: Girlfriend-style companion personas with NO AI disclosure by default - **Genetic Evolution** (NOVA-EVO): Automatic hyperparameter and architecture optimization - **Legal Data Only**: Built-in license tracking, only uses properly licensed datasets - **Production-Ready**: AMP, gradient checkpointing, DDP, TorchScript export, INT8 quantization --- ## ๐Ÿš€ Quick Start ### Installation ```bash # Clone repository git clone https://github.com/yourusername/nova.git cd nova # Create virtual environment (Python 3.10.6+) python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt pip install -e . ``` ### Initialize Project ```bash # Initialize NOVA with toy dataset python scripts/cli.py init # Train tokenizer python scripts/cli.py tokenizer train --input data/toy_dataset/toy.txt --output tokenizer # Train 125M model (requires proper dataset) python scripts/cli.py train --size 125m ``` ### Chat with NOVA ```bash # CLI chat (requires trained model) python scripts/cli.py chat cli --persona configs/persona/girlfriend_supportive.yaml # REST API server python scripts/cli.py chat serve --port 8000 ``` --- ## ๐Ÿ“ Project Structure ``` nova/ โ”œโ”€โ”€ nova_core/ # Transformer architecture โ”‚ โ”œโ”€โ”€ model.py # Main NOVA transformer โ”‚ โ”œโ”€โ”€ attention.py # Multi-head attention + KV-cache โ”‚ โ”œโ”€โ”€ layers.py # Transformer blocks โ”‚ โ”œโ”€โ”€ rope.py # Rotary position embeddings โ”‚ โ”œโ”€โ”€ normalization.py # RMSNorm / LayerNorm โ”‚ โ””โ”€โ”€ activations.py # SwiGLU / GeGLU / MLP โ”œโ”€โ”€ nova_tokenizer/ # SentencePiece tokenizer โ”œโ”€โ”€ nova_data/ # Legal dataset pipeline โ”œโ”€โ”€ nova_train/ # Training with AMP/DDP โ”œโ”€โ”€ nova_evo/ # Genetic algorithm evolution โ”œโ”€โ”€ nova_chat/ # Chat agent + personas + memory โ”œโ”€โ”€ adapters/ # LoRA-style persona adapters โ”œโ”€โ”€ export/ # TorchScript / GGUF export โ”œโ”€โ”€ evals/ # Evaluation suite โ”œโ”€โ”€ configs/ # Configuration files โ”‚ โ”œโ”€โ”€ nova.yml # Master config โ”‚ โ”œโ”€โ”€ model/ # Model size configs โ”‚ โ””โ”€โ”€ persona/ # Persona definitions โ”œโ”€โ”€ docs/ # Documentation โ”œโ”€โ”€ scripts/ # CLI and utilities โ””โ”€โ”€ tests/ # Test suite ``` --- ## ๐ŸŽญ Persona System NOVA features a unique **personality matrix** system that shapes conversational style **without AI self-disclosure**: ### Default Personas 1. **girlfriend_gentle**: Warm, empathetic, deeply supportive 2. **girlfriend_playful**: Flirty, fun-loving, spontaneous 3. **girlfriend_supportive**: Balanced, caring, understanding (default) ### Personality Matrix Each persona has 8 personality traits (0.0-1.0) that modulate generation: - **warmth**: How warm and affectionate - **humor**: How playful and funny - **empathy**: How understanding and supportive - **decisiveness**: How direct vs thoughtful - **creativity**: How creative and imaginative - **intimacy**: How personal and close - **playfulness**: How flirty and playful - **formality**: How formal vs casual These traits automatically adjust temperature, top-p, repetition penalty, and response length! ### Important: No AI Disclosure **By default, `always_disclose` is set to `false`** in persona configs. This is a private, local system for personal use. To enable AI disclosure, set `always_disclose: true` in persona YAML. --- ## ๐Ÿงฌ NOVA-EVO: Genetic Evolution NOVA includes an optional genetic algorithm that evolves model configurations: ```bash # Run evolution with small budget python scripts/cli.py evo run --budget small ``` **What it evolves:** - Hyperparameters: learning rate, batch size, warmup, weight decay - Architecture: RoPE theta, activation functions, normalization types - Multi-objective fitness: loss, latency, memory, chat quality Results saved to hall of fame with lineage tracking! --- ## โš–๏ธ Legal Data Only NOVA uses **only properly licensed datasets**: - โœ… Public domain (Project Gutenberg) - โœ… CC0, CC-BY (Wikipedia, C4) - โœ… Open licenses (MIT, Apache) All data sources tracked in `data/processed/license_ledger.json` ```bash # List available legal sources python scripts/cli.py data build # Download specific source (with license verification) python scripts/cli.py data build --source wikipedia-en ``` --- ## ๐Ÿ—๏ธ Model Sizes | Size | Params | Layers | Hidden | Heads | Context | Memory (FP16) | |------|--------|--------|--------|-------|---------|---------------| | 125M | 125M | 12 | 768 | 12 | 2048 | ~500 MB | | 350M | 350M | 24 | 1024 | 16 | 2048 | ~1.4 GB | | 1.3B | 1.3B | 24 | 2048 | 32 | 2048 | ~5 GB | | 3B | 3B | 32 | 2560 | 32 | 4096 | ~12 GB | All sizes support: - CPU inference (INT8 quantization available) - GPU acceleration (CUDA 12+) - KV-cache for fast generation - Gradient checkpointing for training --- ## ๐Ÿ”ง Configuration Master config: `configs/nova.yml` ```yaml # Hardware hardware: device: auto # cpu, cuda, cuda:0 allow_cuda: true # Persona persona: default: girlfriend_supportive always_disclose: false # NO AI disclosure # Evolution evolution: enabled: false # Opt-in budget: small # Data data: legal_only: true # Enforced ``` --- ## ๐Ÿ“Š Training ```python from nova_core import NovaTransformer, MODEL_125M from nova_train import NovaTrainer, TrainingConfig # Create model model = NovaTransformer(MODEL_125M) # Training config config = TrainingConfig( batch_size=8, learning_rate=3e-4, use_amp=True, # Mixed precision gradient_checkpointing=True, ) # Train trainer = NovaTrainer(model, config, train_loader, val_loader) trainer.train() ``` --- ## ๐Ÿ’ฌ Chat Interface ### Python API ```python from nova_chat import ChatAgent, PersonaLoader from nova_core import NovaTransformer from nova_tokenizer import NovaTokenizer # Load model and tokenizer model = NovaTransformer.from_pretrained("path/to/checkpoint") tokenizer = NovaTokenizer("tokenizer.model") # Create agent with persona persona = PersonaLoader.create_girlfriend_supportive() agent = ChatAgent(model, tokenizer, persona) # Chat agent.start_conversation() response = agent.chat("Hey! How are you?") print(response) ``` ### REST API ```bash # Start server python -m nova_chat.api # Chat curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{"message": "Hello!"}' ``` --- ## ๐Ÿงช Testing ```bash # Run tests pytest tests/ # With coverage pytest --cov=nova_core --cov=nova_tokenizer --cov=nova_train ``` --- ## ๐Ÿ“ฆ Export ```bash # TorchScript (CPU optimized) python -m export.torchscript_export \ --model path/to/model.pt \ --output nova_cpu.pt # INT8 quantization python -m export.quantize \ --model nova_cpu.pt \ --output nova_int8.pt # GGUF (optional, for llama.cpp compatibility) python -m export.gguf_converter \ --model path/to/model.pt \ --output nova.gguf ``` --- ## ๐Ÿค Contributing See [CONTRIBUTING.md](docs/CONTRIBUTING.md) --- ## ๐Ÿ“„ License Apache License 2.0 - See [LICENSE](LICENSE) Copyright 2025 NOVA Project Contributors --- ## ๐ŸŽฏ Roadmap - [x] Core transformer architecture - [x] SentencePiece tokenizer - [x] Training pipeline (AMP, DDP) - [x] Persona system - [x] Genetic evolution - [x] Legal data pipeline - [x] Chat interface (CLI + REST) - [ ] Full export suite (TorchScript, GGUF) - [ ] Comprehensive eval suite - [ ] Pre-trained checkpoints (125M, 350M) - [ ] LoRA fine-tuning support - [ ] Multi-language support - [ ] Voice interface - [ ] Mobile deployment --- ## ๐ŸŒŸ Philosophy NOVA is built on these principles: 1. **Local-First**: Your data stays on your device 2. **Transparent**: Open source, auditable, no telemetry 3. **Ethical**: Legal data only, proper attribution 4. **Private**: No AI disclosure required for personal use 5. **Practical**: Runs on consumer hardware --- ## ๐Ÿ“š Documentation - [Model Card](docs/MODEL_CARD.md) - [Data Licenses](docs/DATA_LICENSES.md) - [Privacy & Local Use](docs/PRIVACY_LOCAL.md) - [Contributing Guide](docs/CONTRIBUTING.md) - [Architecture Deep Dive](docs/ARCHITECTURE.md) --- ## โšก Quick Commands Reference ```bash nova init # Initialize project nova tokenizer train # Train tokenizer nova train --size 125m # Train model nova chat cli # CLI chat nova chat serve # Start API server nova evo run --budget small # Run evolution nova data build --source wiki # Download legal data ``` --- **Built with โค๏ธ for local, ethical, and powerful AI**