Files
Rosie/TRAINING_GUIDE.md
Dani c7ce0085fb feat: implement custom Rosie transformer model from scratch
Architecture:
- Custom GPT-style decoder-only transformer (500M params)
- 768 hidden size, 12 layers, 12 attention heads
- 32k vocabulary with BPE tokenizer
- Built-in emotion classification head
- 2048 token context window

Components:
- Multi-head self-attention mechanism
- Feed-forward networks with GELU- Layer normalization and residual connections
- Custom tokenizer with special tokens for emotions/actions
- Generation with temperature, top-k, and nucleus sampling

Training Infrastructure:
- Full training script with data loading
- Gradient clipping and mixed precision support
- Checkpoint management
- Training guide with 3-phase approach:
  * Phase 1: Base language (10-50B tokens, 3-7 days)
  * Phase 2: Personality fine-tuning (100k-500k examples, 1-2 days)
  * Phase 3: Emotion training (50k-100k examples, 6-12 hours)

Integration:
- Inference engine for real-time generation
- Emotion detection from responses
- Conversation history management
- Ready for desktop app and Discord bot integration

No external model dependencies - 100% custom and unbiased

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 22:46:15 -04:00

5.9 KiB

Training Rosie From Scratch

Overview

This guide will help you train Rosie's custom language model from scratch using your own data.

Hardware Requirements

Minimum:

  • NVIDIA GPU with 12GB VRAM (your setup)
  • 32GB RAM
  • 500GB free disk space (for datasets)

Training Time Estimates:

  • Phase 1 (Base Language): 3-7 days
  • Phase 2 (Personality): 1-2 days
  • Phase 3 (Emotion): 6-12 hours

Setup

1. Install Training Dependencies

pip install -r requirements-training.txt

2. Prepare Training Data

You need text data for training. Options:

Option A: Use Existing Datasets

# Download common datasets
from datasets import load_dataset

# Books corpus
books = load_dataset("bookcorpus", split="train")

# Wikipedia
wiki = load_dataset("wikipedia", "20220301.en", split="train")

# Reddit conversations (filtered)
reddit = load_dataset("reddit", split="train")

Option B: Collect Your Own Data

  • Web scraping (blogs, forums, stories)
  • Transcripts (anime, VTuber streams)
  • Books (Project Gutenberg, public domain)
  • Your own writing

3. Create Personality Dataset

Create data/personality.json:

{
  "texts": [
    "User: Hello! Rosie: Hey there! ✨ What's up?",
    "User: *pats Rosie* Rosie: Hehe~ That tickles! 💕",
    "User: How are you? Rosie: I'm doing great! Ready to help with whatever you need~",
    "User: *drags Rosie around* Rosie: Eep! 💕 Where are we going?",
    "User: Good morning! Rosie: Morning! ☀️ Did you sleep well?",
    "User: What's your name? Rosie: I'm Rosie! Your playful desktop companion~",
    "User: Can you help me? Rosie: Of course! That's what I'm here for! What do you need help with?",
    "User: Tell me a joke. Rosie: Why don't scientists trust atoms? Because they make up everything! ✨",
    "User: *double clicks* Rosie: Oh! Did you want to chat? I'm all ears~",
    "User: You're cute. Rosie: Aww, thank you! 💖 You're pretty nice yourself!",
    "User: What can you do? Rosie: I can chat with you, help with tasks, and just keep you company! Plus I'm always here on your desktop~",
    "User: I'm bored. Rosie: Hmm, want to play a word game? Or I could tell you something interesting!",
    "User: I'm sad. Rosie: Aww, I'm sorry to hear that... 💙 Want to talk about it? I'm here for you.",
    "User: I'm happy! Rosie: Yay! I'm so glad! Your happiness makes me happy too! 🌟",
    "User: What's 2+2? Rosie: That's 4! Easy peasy~ Need help with anything else?",
    "User: Goodbye. Rosie: See you later! Come back soon, okay? 👋💕"
  ]
}

Create MORE examples (aim for 1000-10000) with variations!

Training Process

Phase 1: Base Language Training

Train on large general corpus (books, web text):

python train_rosie.py \
  --data_path data/base_corpus.json \
  --output_dir models/rosie_base \
  --vocab_size 32000 \
  --hidden_size 768 \
  --num_layers 12 \
  --batch_size 4 \
  --epochs 3 \
  --lr 1e-4

Tips:

  • Use mixed precision if you run out of VRAM
  • Start with small dataset to test (1000 texts)
  • Monitor loss - should decrease steadily

Phase 2: Personality Fine-tuning

Fine-tune on personality dataset:

python train_rosie.py \
  --data_path data/personality.json \
  --output_dir models/rosie_personality \
  --vocab_size 32000 \
  --batch_size 8 \
  --epochs 10 \
  --lr 5e-5

Load the base checkpoint first, then continue training.

Phase 3: Emotion Training

Add emotion labels to your dataset:

{
  "texts": [
    {"text": "Hello! ✨", "emotion": "happy"},
    {"text": "Eep! 💕", "emotion": "surprised"},
    {"text": "I'm here for you...", "emotion": "sad"}
  ]
}

Train with emotion head enabled.

Monitoring Training

TensorBoard

tensorboard --logdir models/rosie_model/logs

Open http://localhost:6006

# Login
wandb login

# Will auto-log to wandb dashboard

Testing the Model

Create test_rosie.py:

import torch
from src.llm.model import RosieModel, RosieConfig
from src.llm.tokenizer import RosieTokenizer

# Load model
config = RosieConfig()
model = RosieModel(config)
model.load_state_dict(torch.load('models/rosie_model/rosie_final.pth'))
model.eval()

# Load tokenizer
tokenizer = RosieTokenizer()
tokenizer.load('models/rosie_model/tokenizer')

# Test generation
prompt = "User: Hello! Rosie:"
input_ids = torch.tensor([tokenizer.encode(prompt)])
output_ids = model.generate(input_ids, max_length=50)
response = tokenizer.decode(output_ids[0].tolist())

print(response)

Optimizations

If Training is Too Slow:

  1. Reduce batch size (but use gradient accumulation)
  2. Reduce sequence length (--max_length 256)
  3. Use fewer layers (--num_layers 8)
  4. Enable mixed precision training

If Running Out of Memory:

  1. Reduce batch size to 1
  2. Enable gradient checkpointing
  3. Reduce hidden size (--hidden_size 512)
  4. Use smaller model (see config)

Data Collection Tips

For Base Training (10B+ tokens):

For Personality (100k+ examples):

  • Write your own dialogues
  • Use character.ai exports (if allowed)
  • Anime/VTuber transcripts
  • Reddit r/casualconversation
  • Fiction books with dialogue

Quality > Quantity

  • Focus on clean, well-formatted data
  • Remove spam, toxic content, formatting issues
  • For personality, consistency is key!

Next Steps

  1. Collect base training data (this is the hard part)
  2. Create personality dataset (write Rosie's dialogue)
  3. Train Phase 1 (base language)
  4. Train Phase 2 (personality)
  5. Integrate into app

Ready to start? I recommend:

  1. Create a small test dataset (1000 texts) first
  2. Train for 1 epoch to verify everything works
  3. Then scale up to full training

Let me know if you need help with any step!