# Training Rosie From Scratch

## Overview

This guide will help you train Rosie's custom language model from scratch using your own data.

## Hardware Requirements

**Minimum:**
- NVIDIA GPU with 12GB VRAM (your setup)
- 32GB RAM
- 500GB free disk space (for datasets)

**Training Time Estimates:**
- Phase 1 (Base Language): 3-7 days
- Phase 2 (Personality): 1-2 days
- Phase 3 (Emotion): 6-12 hours

## Setup

### 1. Install Training Dependencies

```bash
pip install -r requirements-training.txt
```

### 2. Prepare Training Data

You need text data for training. Options:

#### Option A: Use Existing Datasets
```python
# Download common datasets
from datasets import load_dataset

# Books corpus
books = load_dataset("bookcorpus", split="train")

# Wikipedia
wiki = load_dataset("wikipedia", "20220301.en", split="train")

# Reddit conversations (filtered)
reddit = load_dataset("reddit", split="train")
```

#### Option B: Collect Your Own Data
- Web scraping (blogs, forums, stories)
- Transcripts (anime, VTuber streams)
- Books (Project Gutenberg, public domain)
- Your own writing

### 3. Create Personality Dataset

Create `data/personality.json`:

```json
{
  "texts": [
    "User: Hello! Rosie: Hey there! ✨ What's up?",
    "User: *pats Rosie* Rosie: Hehe~ That tickles! 💕",
    "User: How are you? Rosie: I'm doing great! Ready to help with whatever you need~",
    "User: *drags Rosie around* Rosie: Eep! 💕 Where are we going?",
    "User: Good morning! Rosie: Morning! ☀️ Did you sleep well?",
    "User: What's your name? Rosie: I'm Rosie! Your playful desktop companion~",
    "User: Can you help me? Rosie: Of course! That's what I'm here for! What do you need help with?",
    "User: Tell me a joke. Rosie: Why don't scientists trust atoms? Because they make up everything! ✨",
    "User: *double clicks* Rosie: Oh! Did you want to chat? I'm all ears~",
    "User: You're cute. Rosie: Aww, thank you! 💖 You're pretty nice yourself!",
    "User: What can you do? Rosie: I can chat with you, help with tasks, and just keep you company! Plus I'm always here on your desktop~",
    "User: I'm bored. Rosie: Hmm, want to play a word game? Or I could tell you something interesting!",
    "User: I'm sad. Rosie: Aww, I'm sorry to hear that... 💙 Want to talk about it? I'm here for you.",
    "User: I'm happy! Rosie: Yay! I'm so glad! Your happiness makes me happy too! 🌟",
    "User: What's 2+2? Rosie: That's 4! Easy peasy~ Need help with anything else?",
    "User: Goodbye. Rosie: See you later! Come back soon, okay? 👋💕"
  ]
}
```

Create MORE examples (aim for 1000-10000) with variations!

## Training Process

### Phase 1: Base Language Training

Train on large general corpus (books, web text):

```bash
python train_rosie.py \
  --data_path data/base_corpus.json \
  --output_dir models/rosie_base \
  --vocab_size 32000 \
  --hidden_size 768 \
  --num_layers 12 \
  --batch_size 4 \
  --epochs 3 \
  --lr 1e-4
```

**Tips:**
- Use mixed precision if you run out of VRAM
- Start with small dataset to test (1000 texts)
- Monitor loss - should decrease steadily

### Phase 2: Personality Fine-tuning

Fine-tune on personality dataset:

```bash
python train_rosie.py \
  --data_path data/personality.json \
  --output_dir models/rosie_personality \
  --vocab_size 32000 \
  --batch_size 8 \
  --epochs 10 \
  --lr 5e-5
```

Load the base checkpoint first, then continue training.

### Phase 3: Emotion Training

Add emotion labels to your dataset:

```json
{
  "texts": [
    {"text": "Hello! ✨", "emotion": "happy"},
    {"text": "Eep! 💕", "emotion": "surprised"},
    {"text": "I'm here for you...", "emotion": "sad"}
  ]
}
```

Train with emotion head enabled.

## Monitoring Training

### TensorBoard

```bash
tensorboard --logdir models/rosie_model/logs
```

Open http://localhost:6006

### Weights & Biases (recommended)

```bash
# Login
wandb login

# Will auto-log to wandb dashboard
```

## Testing the Model

Create `test_rosie.py`:

```python
import torch
from src.llm.model import RosieModel, RosieConfig
from src.llm.tokenizer import RosieTokenizer

# Load model
config = RosieConfig()
model = RosieModel(config)
model.load_state_dict(torch.load('models/rosie_model/rosie_final.pth'))
model.eval()

# Load tokenizer
tokenizer = RosieTokenizer()
tokenizer.load('models/rosie_model/tokenizer')

# Test generation
prompt = "User: Hello! Rosie:"
input_ids = torch.tensor([tokenizer.encode(prompt)])
output_ids = model.generate(input_ids, max_length=50)
response = tokenizer.decode(output_ids[0].tolist())

print(response)
```

## Optimizations

### If Training is Too Slow:
1. Reduce batch size (but use gradient accumulation)
2. Reduce sequence length (--max_length 256)
3. Use fewer layers (--num_layers 8)
4. Enable mixed precision training

### If Running Out of Memory:
1. Reduce batch size to 1
2. Enable gradient checkpointing
3. Reduce hidden size (--hidden_size 512)
4. Use smaller model (see config)

## Data Collection Tips

### For Base Training (10B+ tokens):
- **OpenWebText**: https://skylion007.github.io/OpenWebTextCorpus/
- **The Pile**: https://pile.eleuther.ai/ (800GB)
- **Wikipedia**: https://dumps.wikimedia.org/
- **BookCorpus**: Available via HuggingFace datasets

### For Personality (100k+ examples):
- Write your own dialogues
- Use character.ai exports (if allowed)
- Anime/VTuber transcripts
- Reddit r/casualconversation
- Fiction books with dialogue

### Quality > Quantity
- Focus on clean, well-formatted data
- Remove spam, toxic content, formatting issues
- For personality, consistency is key!

## Next Steps

1. **Collect base training data** (this is the hard part)
2. **Create personality dataset** (write Rosie's dialogue)
3. **Train Phase 1** (base language)
4. **Train Phase 2** (personality)
5. **Integrate into app**

Ready to start? I recommend:
1. Create a small test dataset (1000 texts) first
2. Train for 1 epoch to verify everything works
3. Then scale up to full training

Let me know if you need help with any step!