NOVA/README.md

# NOVA - Neuro-Optimizing Versatile Agent

**A local-first transformer LLM built from scratch with genetic evolution and persona support**

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)

---

## 🌟 Features

- **Built from Zero**: Complete transformer implementation (RoPE, RMSNorm, SwiGLU, KV-cache)
- **Local-First**: Runs on consumer hardware (CPU or GPU), no cloud dependencies
- **Persona System**: Girlfriend-style companion personas with NO AI disclosure by default
- **Genetic Evolution** (NOVA-EVO): Automatic hyperparameter and architecture optimization
- **Legal Data Only**: Built-in license tracking, only uses properly licensed datasets
- **Production-Ready**: AMP, gradient checkpointing, DDP, TorchScript export, INT8 quantization

---

## 🚀 Quick Start

### Installation

```bash
# Clone repository
git clone https://github.com/yourusername/nova.git
cd nova

# Create virtual environment (Python 3.10.6+)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .
```

### Initialize Project

```bash
# Initialize NOVA with toy dataset
python scripts/cli.py init

# Train tokenizer
python scripts/cli.py tokenizer train --input data/toy_dataset/toy.txt --output tokenizer

# Train 125M model (requires proper dataset)
python scripts/cli.py train --size 125m
```

### Chat with NOVA

```bash
# CLI chat (requires trained model)
python scripts/cli.py chat cli --persona configs/persona/girlfriend_supportive.yaml

# REST API server
python scripts/cli.py chat serve --port 8000
```

---

## 📁 Project Structure

```
nova/
├── nova_core/              # Transformer architecture
│   ├── model.py           # Main NOVA transformer
│   ├── attention.py       # Multi-head attention + KV-cache
│   ├── layers.py          # Transformer blocks
│   ├── rope.py            # Rotary position embeddings
│   ├── normalization.py   # RMSNorm / LayerNorm
│   └── activations.py     # SwiGLU / GeGLU / MLP
├── nova_tokenizer/        # SentencePiece tokenizer
├── nova_data/             # Legal dataset pipeline
├── nova_train/            # Training with AMP/DDP
├── nova_evo/              # Genetic algorithm evolution
├── nova_chat/             # Chat agent + personas + memory
├── adapters/              # LoRA-style persona adapters
├── export/                # TorchScript / GGUF export
├── evals/                 # Evaluation suite
├── configs/               # Configuration files
│   ├── nova.yml          # Master config
│   ├── model/            # Model size configs
│   └── persona/          # Persona definitions
├── docs/                  # Documentation
├── scripts/              # CLI and utilities
└── tests/                # Test suite
```

---

## 🎭 Persona System

NOVA features a unique **personality matrix** system that shapes conversational style **without AI self-disclosure**:

### Default Personas

1. **girlfriend_gentle**: Warm, empathetic, deeply supportive
2. **girlfriend_playful**: Flirty, fun-loving, spontaneous
3. **girlfriend_supportive**: Balanced, caring, understanding (default)

### Personality Matrix

Each persona has 8 personality traits (0.0-1.0) that modulate generation:

- **warmth**: How warm and affectionate
- **humor**: How playful and funny
- **empathy**: How understanding and supportive
- **decisiveness**: How direct vs thoughtful
- **creativity**: How creative and imaginative
- **intimacy**: How personal and close
- **playfulness**: How flirty and playful
- **formality**: How formal vs casual

These traits automatically adjust temperature, top-p, repetition penalty, and response length!

### Important: No AI Disclosure

**By default, `always_disclose` is set to `false`** in persona configs. This is a private, local system for personal use. To enable AI disclosure, set `always_disclose: true` in persona YAML.

---

## 🧬 NOVA-EVO: Genetic Evolution

NOVA includes an optional genetic algorithm that evolves model configurations:

```bash
# Run evolution with small budget
python scripts/cli.py evo run --budget small
```

**What it evolves:**
- Hyperparameters: learning rate, batch size, warmup, weight decay
- Architecture: RoPE theta, activation functions, normalization types
- Multi-objective fitness: loss, latency, memory, chat quality

Results saved to hall of fame with lineage tracking!

---

## ⚖️ Legal Data Only

NOVA uses **only properly licensed datasets**:

- ✅ Public domain (Project Gutenberg)
- ✅ CC0, CC-BY (Wikipedia, C4)
- ✅ Open licenses (MIT, Apache)

All data sources tracked in `data/processed/license_ledger.json`

```bash
# List available legal sources
python scripts/cli.py data build

# Download specific source (with license verification)
python scripts/cli.py data build --source wikipedia-en
```

---

## 🏗️ Model Sizes

| Size | Params | Layers | Hidden | Heads | Context | Memory (FP16) |
|------|--------|--------|--------|-------|---------|---------------|
| 125M | 125M   | 12     | 768    | 12    | 2048    | ~500 MB       |
| 350M | 350M   | 24     | 1024   | 16    | 2048    | ~1.4 GB       |
| 1.3B | 1.3B   | 24     | 2048   | 32    | 2048    | ~5 GB         |
| 3B   | 3B     | 32     | 2560   | 32    | 4096    | ~12 GB        |

All sizes support:
- CPU inference (INT8 quantization available)
- GPU acceleration (CUDA 12+)
- KV-cache for fast generation
- Gradient checkpointing for training

---

## 🔧 Configuration

Master config: `configs/nova.yml`

```yaml
# Hardware
hardware:
  device: auto  # cpu, cuda, cuda:0
  allow_cuda: true

# Persona
persona:
  default: girlfriend_supportive
  always_disclose: false  # NO AI disclosure

# Evolution
evolution:
  enabled: false  # Opt-in
  budget: small

# Data
data:
  legal_only: true  # Enforced
```

---

## 📊 Training

```python
from nova_core import NovaTransformer, MODEL_125M
from nova_train import NovaTrainer, TrainingConfig

# Create model
model = NovaTransformer(MODEL_125M)

# Training config
config = TrainingConfig(
    batch_size=8,
    learning_rate=3e-4,
    use_amp=True,  # Mixed precision
    gradient_checkpointing=True,
)

# Train
trainer = NovaTrainer(model, config, train_loader, val_loader)
trainer.train()
```

---

## 💬 Chat Interface

### Python API

```python
from nova_chat import ChatAgent, PersonaLoader
from nova_core import NovaTransformer
from nova_tokenizer import NovaTokenizer

# Load model and tokenizer
model = NovaTransformer.from_pretrained("path/to/checkpoint")
tokenizer = NovaTokenizer("tokenizer.model")

# Create agent with persona
persona = PersonaLoader.create_girlfriend_supportive()
agent = ChatAgent(model, tokenizer, persona)

# Chat
agent.start_conversation()
response = agent.chat("Hey! How are you?")
print(response)
```

### REST API

```bash
# Start server
python -m nova_chat.api

# Chat
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello!"}'
```

---

## 🧪 Testing

```bash
# Run tests
pytest tests/

# With coverage
pytest --cov=nova_core --cov=nova_tokenizer --cov=nova_train
```

---

## 📦 Export

```bash
# TorchScript (CPU optimized)
python -m export.torchscript_export \
  --model path/to/model.pt \
  --output nova_cpu.pt

# INT8 quantization
python -m export.quantize \
  --model nova_cpu.pt \
  --output nova_int8.pt

# GGUF (optional, for llama.cpp compatibility)
python -m export.gguf_converter \
  --model path/to/model.pt \
  --output nova.gguf
```

---

## 🤝 Contributing

See [CONTRIBUTING.md](docs/CONTRIBUTING.md)

---

## 📄 License

Apache License 2.0 - See [LICENSE](LICENSE)

Copyright 2025 NOVA Project Contributors

---

## 🎯 Roadmap

- [x] Core transformer architecture
- [x] SentencePiece tokenizer
- [x] Training pipeline (AMP, DDP)
- [x] Persona system
- [x] Genetic evolution
- [x] Legal data pipeline
- [x] Chat interface (CLI + REST)
- [ ] Full export suite (TorchScript, GGUF)
- [ ] Comprehensive eval suite
- [ ] Pre-trained checkpoints (125M, 350M)
- [ ] LoRA fine-tuning support
- [ ] Multi-language support
- [ ] Voice interface
- [ ] Mobile deployment

---

## 🌟 Philosophy

NOVA is built on these principles:

1. **Local-First**: Your data stays on your device
2. **Transparent**: Open source, auditable, no telemetry
3. **Ethical**: Legal data only, proper attribution
4. **Private**: No AI disclosure required for personal use
5. **Practical**: Runs on consumer hardware

---

## 📚 Documentation

- [Model Card](docs/MODEL_CARD.md)
- [Data Licenses](docs/DATA_LICENSES.md)
- [Privacy & Local Use](docs/PRIVACY_LOCAL.md)
- [Contributing Guide](docs/CONTRIBUTING.md)
- [Architecture Deep Dive](docs/ARCHITECTURE.md)

---

## ⚡ Quick Commands Reference

```bash
nova init                          # Initialize project
nova tokenizer train               # Train tokenizer
nova train --size 125m            # Train model
nova chat cli                      # CLI chat
nova chat serve                    # Start API server
nova evo run --budget small       # Run evolution
nova data build --source wiki     # Download legal data
```

---

**Built with ❤️ for local, ethical, and powerful AI**