Architecture: - Custom GPT-style decoder-only transformer (500M params) - 768 hidden size, 12 layers, 12 attention heads - 32k vocabulary with BPE tokenizer - Built-in emotion classification head - 2048 token context window Components: - Multi-head self-attention mechanism - Feed-forward networks with GELU- Layer normalization and residual connections - Custom tokenizer with special tokens for emotions/actions - Generation with temperature, top-k, and nucleus sampling Training Infrastructure: - Full training script with data loading - Gradient clipping and mixed precision support - Checkpoint management - Training guide with 3-phase approach: * Phase 1: Base language (10-50B tokens, 3-7 days) * Phase 2: Personality fine-tuning (100k-500k examples, 1-2 days) * Phase 3: Emotion training (50k-100k examples, 6-12 hours) Integration: - Inference engine for real-time generation - Emotion detection from responses - Conversation history management - Ready for desktop app and Discord bot integration No external model dependencies - 100% custom and unbiased 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.2 KiB
Rosie Custom Model Design
Architecture Overview
Model Type: Custom Transformer-based Language Model Size: Small (~500M-1B parameters) Framework: PyTorch Training: From scratch Personality: Playful Assistant/Friend
Model Specifications
Architecture
- Type: Decoder-only Transformer (GPT-style)
- Layers: 12-16 transformer blocks
- Hidden Size: 768-1024
- Attention Heads: 12-16
- Context Window: 2048 tokens
- Vocabulary Size: 32k tokens (BPE tokenizer)
Special Features
- Emotion Head: Separate classification head for emotion detection
- Memory Attention: Special attention mechanism for long-term memory
- Personality Embedding: Learned embeddings for consistent personality traits
Training Strategy
Phase 1: Base Language Understanding
Data Sources:
- Common Crawl (filtered for appropriate content)
- Books corpus
- Reddit conversations (filtered)
- Estimated tokens: 10-50B
Goal: Learn basic language, grammar, world knowledge
Phase 2: Personality Fine-tuning
Data Sources:
- Custom dialogue dataset (we'll create)
- Anime/VTuber transcripts (playful personality)
- Assistant conversations (helpful responses)
- Estimated examples: 100k-500k conversations
Goal: Develop Rosie's playful assistant personality
Phase 3: Emotion & Memory Training
Data Sources:
- Conversations labeled with emotions
- Multi-turn dialogues with context
- Estimated examples: 50k-100k
Goal: Emotion detection and contextual memory
Data Collection Plan
What We Need to Create
-
Personality Dataset (~10k examples)
- Playful greetings
- Helpful responses
- Reactions to being touched/moved
- Idle conversation starters
- Emotional responses
-
Conversation Templates
-
User: "Hello!"
-
Rosie: "Hey there! ✨ What's up?"
-
User: drags Rosie
-
Rosie: "Eep! 💕 Where are we going?"
-
User: "How are you?"
-
Rosie: "I'm doing great! Ready to help with whatever you need~"
-
-
Emotion Labels
- Map responses to emotion states (happy, sad, surprised, etc.)
- Train emotion classifier alongside text generation
Training Hardware Requirements
Your Setup (12GB VRAM)
- ✅ Can train 500M model with batch size 4-8
- ✅ Use gradient accumulation for effective larger batches
- ✅ Mixed precision training (FP16)
- ⚠️ May need gradient checkpointing for 1B model
Estimated Training Time
- Phase 1 (base): 3-7 days on single GPU
- Phase 2 (personality): 1-2 days
- Phase 3 (emotion): 6-12 hours
Model Files Structure
models/
├── rosie_model/
│ ├── config.json # Model architecture config
│ ├── tokenizer/ # BPE tokenizer files
│ ├── weights/
│ │ ├── base.pth # Base language model
│ │ ├── personality.pth # Fine-tuned personality
│ │ └── final.pth # Final trained model
│ └── checkpoints/ # Training checkpoints
Implementation Plan
Step 1: Create Model Architecture
- Custom transformer implementation
- Emotion classification head
- Memory attention mechanism
Step 2: Create Tokenizer
- Train BPE tokenizer on diverse text
- 32k vocab size
- Special tokens for emotions/actions
Step 3: Data Pipeline
- Download/prepare base training data
- Create custom personality dataset
- Build efficient data loaders
Step 4: Training Loop
- Implement training script
- Add logging (wandb/tensorboard)
- Checkpoint management
- Evaluation metrics
Step 5: Integration
- Load model in app
- Inference optimization (quantization, caching)
- Real-time response generation
Alternative: Bootstrap Approach
If training from scratch takes too long, we can:
- Start with a small pre-trained model (Phi-2, TinyLlama)
- Fine-tune heavily on personality data
- Add emotion head on top
- Much faster (hours instead of days)
Recommendation: Start with bootstrap approach, transition to full custom model later if needed.
Next Steps
- Choose approach (from-scratch vs bootstrap)
- Set up training environment
- Create initial personality dataset
- Implement model architecture
- Begin training
What do you think? Should we go full custom from scratch, or bootstrap from a small existing model?