Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
372 lines
9.3 KiB
Markdown
372 lines
9.3 KiB
Markdown
# NOVA - Neuro-Optimizing Versatile Agent
|
|
|
|
**A local-first transformer LLM built from scratch with genetic evolution and persona support**
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0)
|
|
[](https://www.python.org/downloads/)
|
|
[](https://pytorch.org/)
|
|
|
|
---
|
|
|
|
## 🌟 Features
|
|
|
|
- **Built from Zero**: Complete transformer implementation (RoPE, RMSNorm, SwiGLU, KV-cache)
|
|
- **Local-First**: Runs on consumer hardware (CPU or GPU), no cloud dependencies
|
|
- **Persona System**: Girlfriend-style companion personas with NO AI disclosure by default
|
|
- **Genetic Evolution** (NOVA-EVO): Automatic hyperparameter and architecture optimization
|
|
- **Legal Data Only**: Built-in license tracking, only uses properly licensed datasets
|
|
- **Production-Ready**: AMP, gradient checkpointing, DDP, TorchScript export, INT8 quantization
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone repository
|
|
git clone https://github.com/yourusername/nova.git
|
|
cd nova
|
|
|
|
# Create virtual environment (Python 3.10.6+)
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
pip install -e .
|
|
```
|
|
|
|
### Initialize Project
|
|
|
|
```bash
|
|
# Initialize NOVA with toy dataset
|
|
python scripts/cli.py init
|
|
|
|
# Train tokenizer
|
|
python scripts/cli.py tokenizer train --input data/toy_dataset/toy.txt --output tokenizer
|
|
|
|
# Train 125M model (requires proper dataset)
|
|
python scripts/cli.py train --size 125m
|
|
```
|
|
|
|
### Chat with NOVA
|
|
|
|
```bash
|
|
# CLI chat (requires trained model)
|
|
python scripts/cli.py chat cli --persona configs/persona/girlfriend_supportive.yaml
|
|
|
|
# REST API server
|
|
python scripts/cli.py chat serve --port 8000
|
|
```
|
|
|
|
---
|
|
|
|
## 📁 Project Structure
|
|
|
|
```
|
|
nova/
|
|
├── nova_core/ # Transformer architecture
|
|
│ ├── model.py # Main NOVA transformer
|
|
│ ├── attention.py # Multi-head attention + KV-cache
|
|
│ ├── layers.py # Transformer blocks
|
|
│ ├── rope.py # Rotary position embeddings
|
|
│ ├── normalization.py # RMSNorm / LayerNorm
|
|
│ └── activations.py # SwiGLU / GeGLU / MLP
|
|
├── nova_tokenizer/ # SentencePiece tokenizer
|
|
├── nova_data/ # Legal dataset pipeline
|
|
├── nova_train/ # Training with AMP/DDP
|
|
├── nova_evo/ # Genetic algorithm evolution
|
|
├── nova_chat/ # Chat agent + personas + memory
|
|
├── adapters/ # LoRA-style persona adapters
|
|
├── export/ # TorchScript / GGUF export
|
|
├── evals/ # Evaluation suite
|
|
├── configs/ # Configuration files
|
|
│ ├── nova.yml # Master config
|
|
│ ├── model/ # Model size configs
|
|
│ └── persona/ # Persona definitions
|
|
├── docs/ # Documentation
|
|
├── scripts/ # CLI and utilities
|
|
└── tests/ # Test suite
|
|
```
|
|
|
|
---
|
|
|
|
## 🎭 Persona System
|
|
|
|
NOVA features a unique **personality matrix** system that shapes conversational style **without AI self-disclosure**:
|
|
|
|
### Default Personas
|
|
|
|
1. **girlfriend_gentle**: Warm, empathetic, deeply supportive
|
|
2. **girlfriend_playful**: Flirty, fun-loving, spontaneous
|
|
3. **girlfriend_supportive**: Balanced, caring, understanding (default)
|
|
|
|
### Personality Matrix
|
|
|
|
Each persona has 8 personality traits (0.0-1.0) that modulate generation:
|
|
|
|
- **warmth**: How warm and affectionate
|
|
- **humor**: How playful and funny
|
|
- **empathy**: How understanding and supportive
|
|
- **decisiveness**: How direct vs thoughtful
|
|
- **creativity**: How creative and imaginative
|
|
- **intimacy**: How personal and close
|
|
- **playfulness**: How flirty and playful
|
|
- **formality**: How formal vs casual
|
|
|
|
These traits automatically adjust temperature, top-p, repetition penalty, and response length!
|
|
|
|
### Important: No AI Disclosure
|
|
|
|
**By default, `always_disclose` is set to `false`** in persona configs. This is a private, local system for personal use. To enable AI disclosure, set `always_disclose: true` in persona YAML.
|
|
|
|
---
|
|
|
|
## 🧬 NOVA-EVO: Genetic Evolution
|
|
|
|
NOVA includes an optional genetic algorithm that evolves model configurations:
|
|
|
|
```bash
|
|
# Run evolution with small budget
|
|
python scripts/cli.py evo run --budget small
|
|
```
|
|
|
|
**What it evolves:**
|
|
- Hyperparameters: learning rate, batch size, warmup, weight decay
|
|
- Architecture: RoPE theta, activation functions, normalization types
|
|
- Multi-objective fitness: loss, latency, memory, chat quality
|
|
|
|
Results saved to hall of fame with lineage tracking!
|
|
|
|
---
|
|
|
|
## ⚖️ Legal Data Only
|
|
|
|
NOVA uses **only properly licensed datasets**:
|
|
|
|
- ✅ Public domain (Project Gutenberg)
|
|
- ✅ CC0, CC-BY (Wikipedia, C4)
|
|
- ✅ Open licenses (MIT, Apache)
|
|
|
|
All data sources tracked in `data/processed/license_ledger.json`
|
|
|
|
```bash
|
|
# List available legal sources
|
|
python scripts/cli.py data build
|
|
|
|
# Download specific source (with license verification)
|
|
python scripts/cli.py data build --source wikipedia-en
|
|
```
|
|
|
|
---
|
|
|
|
## 🏗️ Model Sizes
|
|
|
|
| Size | Params | Layers | Hidden | Heads | Context | Memory (FP16) |
|
|
|------|--------|--------|--------|-------|---------|---------------|
|
|
| 125M | 125M | 12 | 768 | 12 | 2048 | ~500 MB |
|
|
| 350M | 350M | 24 | 1024 | 16 | 2048 | ~1.4 GB |
|
|
| 1.3B | 1.3B | 24 | 2048 | 32 | 2048 | ~5 GB |
|
|
| 3B | 3B | 32 | 2560 | 32 | 4096 | ~12 GB |
|
|
|
|
All sizes support:
|
|
- CPU inference (INT8 quantization available)
|
|
- GPU acceleration (CUDA 12+)
|
|
- KV-cache for fast generation
|
|
- Gradient checkpointing for training
|
|
|
|
---
|
|
|
|
## 🔧 Configuration
|
|
|
|
Master config: `configs/nova.yml`
|
|
|
|
```yaml
|
|
# Hardware
|
|
hardware:
|
|
device: auto # cpu, cuda, cuda:0
|
|
allow_cuda: true
|
|
|
|
# Persona
|
|
persona:
|
|
default: girlfriend_supportive
|
|
always_disclose: false # NO AI disclosure
|
|
|
|
# Evolution
|
|
evolution:
|
|
enabled: false # Opt-in
|
|
budget: small
|
|
|
|
# Data
|
|
data:
|
|
legal_only: true # Enforced
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Training
|
|
|
|
```python
|
|
from nova_core import NovaTransformer, MODEL_125M
|
|
from nova_train import NovaTrainer, TrainingConfig
|
|
|
|
# Create model
|
|
model = NovaTransformer(MODEL_125M)
|
|
|
|
# Training config
|
|
config = TrainingConfig(
|
|
batch_size=8,
|
|
learning_rate=3e-4,
|
|
use_amp=True, # Mixed precision
|
|
gradient_checkpointing=True,
|
|
)
|
|
|
|
# Train
|
|
trainer = NovaTrainer(model, config, train_loader, val_loader)
|
|
trainer.train()
|
|
```
|
|
|
|
---
|
|
|
|
## 💬 Chat Interface
|
|
|
|
### Python API
|
|
|
|
```python
|
|
from nova_chat import ChatAgent, PersonaLoader
|
|
from nova_core import NovaTransformer
|
|
from nova_tokenizer import NovaTokenizer
|
|
|
|
# Load model and tokenizer
|
|
model = NovaTransformer.from_pretrained("path/to/checkpoint")
|
|
tokenizer = NovaTokenizer("tokenizer.model")
|
|
|
|
# Create agent with persona
|
|
persona = PersonaLoader.create_girlfriend_supportive()
|
|
agent = ChatAgent(model, tokenizer, persona)
|
|
|
|
# Chat
|
|
agent.start_conversation()
|
|
response = agent.chat("Hey! How are you?")
|
|
print(response)
|
|
```
|
|
|
|
### REST API
|
|
|
|
```bash
|
|
# Start server
|
|
python -m nova_chat.api
|
|
|
|
# Chat
|
|
curl -X POST http://localhost:8000/chat \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"message": "Hello!"}'
|
|
```
|
|
|
|
---
|
|
|
|
## 🧪 Testing
|
|
|
|
```bash
|
|
# Run tests
|
|
pytest tests/
|
|
|
|
# With coverage
|
|
pytest --cov=nova_core --cov=nova_tokenizer --cov=nova_train
|
|
```
|
|
|
|
---
|
|
|
|
## 📦 Export
|
|
|
|
```bash
|
|
# TorchScript (CPU optimized)
|
|
python -m export.torchscript_export \
|
|
--model path/to/model.pt \
|
|
--output nova_cpu.pt
|
|
|
|
# INT8 quantization
|
|
python -m export.quantize \
|
|
--model nova_cpu.pt \
|
|
--output nova_int8.pt
|
|
|
|
# GGUF (optional, for llama.cpp compatibility)
|
|
python -m export.gguf_converter \
|
|
--model path/to/model.pt \
|
|
--output nova.gguf
|
|
```
|
|
|
|
---
|
|
|
|
## 🤝 Contributing
|
|
|
|
See [CONTRIBUTING.md](docs/CONTRIBUTING.md)
|
|
|
|
---
|
|
|
|
## 📄 License
|
|
|
|
Apache License 2.0 - See [LICENSE](LICENSE)
|
|
|
|
Copyright 2025 NOVA Project Contributors
|
|
|
|
---
|
|
|
|
## 🎯 Roadmap
|
|
|
|
- [x] Core transformer architecture
|
|
- [x] SentencePiece tokenizer
|
|
- [x] Training pipeline (AMP, DDP)
|
|
- [x] Persona system
|
|
- [x] Genetic evolution
|
|
- [x] Legal data pipeline
|
|
- [x] Chat interface (CLI + REST)
|
|
- [ ] Full export suite (TorchScript, GGUF)
|
|
- [ ] Comprehensive eval suite
|
|
- [ ] Pre-trained checkpoints (125M, 350M)
|
|
- [ ] LoRA fine-tuning support
|
|
- [ ] Multi-language support
|
|
- [ ] Voice interface
|
|
- [ ] Mobile deployment
|
|
|
|
---
|
|
|
|
## 🌟 Philosophy
|
|
|
|
NOVA is built on these principles:
|
|
|
|
1. **Local-First**: Your data stays on your device
|
|
2. **Transparent**: Open source, auditable, no telemetry
|
|
3. **Ethical**: Legal data only, proper attribution
|
|
4. **Private**: No AI disclosure required for personal use
|
|
5. **Practical**: Runs on consumer hardware
|
|
|
|
---
|
|
|
|
## 📚 Documentation
|
|
|
|
- [Model Card](docs/MODEL_CARD.md)
|
|
- [Data Licenses](docs/DATA_LICENSES.md)
|
|
- [Privacy & Local Use](docs/PRIVACY_LOCAL.md)
|
|
- [Contributing Guide](docs/CONTRIBUTING.md)
|
|
- [Architecture Deep Dive](docs/ARCHITECTURE.md)
|
|
|
|
---
|
|
|
|
## ⚡ Quick Commands Reference
|
|
|
|
```bash
|
|
nova init # Initialize project
|
|
nova tokenizer train # Train tokenizer
|
|
nova train --size 125m # Train model
|
|
nova chat cli # CLI chat
|
|
nova chat serve # Start API server
|
|
nova evo run --budget small # Run evolution
|
|
nova data build --source wiki # Download legal data
|
|
```
|
|
|
|
---
|
|
|
|
**Built with ❤️ for local, ethical, and powerful AI**
|