Architecture: - Custom GPT-style decoder-only transformer (500M params) - 768 hidden size, 12 layers, 12 attention heads - 32k vocabulary with BPE tokenizer - Built-in emotion classification head - 2048 token context window Components: - Multi-head self-attention mechanism - Feed-forward networks with GELU- Layer normalization and residual connections - Custom tokenizer with special tokens for emotions/actions - Generation with temperature, top-k, and nucleus sampling Training Infrastructure: - Full training script with data loading - Gradient clipping and mixed precision support - Checkpoint management - Training guide with 3-phase approach: * Phase 1: Base language (10-50B tokens, 3-7 days) * Phase 2: Personality fine-tuning (100k-500k examples, 1-2 days) * Phase 3: Emotion training (50k-100k examples, 6-12 hours) Integration: - Inference engine for real-time generation - Emotion detection from responses - Conversation history management - Ready for desktop app and Discord bot integration No external model dependencies - 100% custom and unbiased 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
153 lines
4.2 KiB
Markdown
153 lines
4.2 KiB
Markdown
# Rosie Custom Model Design
|
|
|
|
## Architecture Overview
|
|
|
|
**Model Type:** Custom Transformer-based Language Model
|
|
**Size:** Small (~500M-1B parameters)
|
|
**Framework:** PyTorch
|
|
**Training:** From scratch
|
|
**Personality:** Playful Assistant/Friend
|
|
|
|
## Model Specifications
|
|
|
|
### Architecture
|
|
- **Type:** Decoder-only Transformer (GPT-style)
|
|
- **Layers:** 12-16 transformer blocks
|
|
- **Hidden Size:** 768-1024
|
|
- **Attention Heads:** 12-16
|
|
- **Context Window:** 2048 tokens
|
|
- **Vocabulary Size:** 32k tokens (BPE tokenizer)
|
|
|
|
### Special Features
|
|
1. **Emotion Head:** Separate classification head for emotion detection
|
|
2. **Memory Attention:** Special attention mechanism for long-term memory
|
|
3. **Personality Embedding:** Learned embeddings for consistent personality traits
|
|
|
|
## Training Strategy
|
|
|
|
### Phase 1: Base Language Understanding
|
|
**Data Sources:**
|
|
- Common Crawl (filtered for appropriate content)
|
|
- Books corpus
|
|
- Reddit conversations (filtered)
|
|
- Estimated tokens: 10-50B
|
|
|
|
**Goal:** Learn basic language, grammar, world knowledge
|
|
|
|
### Phase 2: Personality Fine-tuning
|
|
**Data Sources:**
|
|
- Custom dialogue dataset (we'll create)
|
|
- Anime/VTuber transcripts (playful personality)
|
|
- Assistant conversations (helpful responses)
|
|
- Estimated examples: 100k-500k conversations
|
|
|
|
**Goal:** Develop Rosie's playful assistant personality
|
|
|
|
### Phase 3: Emotion & Memory Training
|
|
**Data Sources:**
|
|
- Conversations labeled with emotions
|
|
- Multi-turn dialogues with context
|
|
- Estimated examples: 50k-100k
|
|
|
|
**Goal:** Emotion detection and contextual memory
|
|
|
|
## Data Collection Plan
|
|
|
|
### What We Need to Create
|
|
|
|
1. **Personality Dataset (~10k examples)**
|
|
- Playful greetings
|
|
- Helpful responses
|
|
- Reactions to being touched/moved
|
|
- Idle conversation starters
|
|
- Emotional responses
|
|
|
|
2. **Conversation Templates**
|
|
- User: "Hello!"
|
|
- Rosie: "Hey there! ✨ What's up?"
|
|
|
|
- User: *drags Rosie*
|
|
- Rosie: "Eep! 💕 Where are we going?"
|
|
|
|
- User: "How are you?"
|
|
- Rosie: "I'm doing great! Ready to help with whatever you need~"
|
|
|
|
3. **Emotion Labels**
|
|
- Map responses to emotion states (happy, sad, surprised, etc.)
|
|
- Train emotion classifier alongside text generation
|
|
|
|
## Training Hardware Requirements
|
|
|
|
### Your Setup (12GB VRAM)
|
|
- ✅ Can train 500M model with batch size 4-8
|
|
- ✅ Use gradient accumulation for effective larger batches
|
|
- ✅ Mixed precision training (FP16)
|
|
- ⚠️ May need gradient checkpointing for 1B model
|
|
|
|
### Estimated Training Time
|
|
- Phase 1 (base): 3-7 days on single GPU
|
|
- Phase 2 (personality): 1-2 days
|
|
- Phase 3 (emotion): 6-12 hours
|
|
|
|
## Model Files Structure
|
|
|
|
```
|
|
models/
|
|
├── rosie_model/
|
|
│ ├── config.json # Model architecture config
|
|
│ ├── tokenizer/ # BPE tokenizer files
|
|
│ ├── weights/
|
|
│ │ ├── base.pth # Base language model
|
|
│ │ ├── personality.pth # Fine-tuned personality
|
|
│ │ └── final.pth # Final trained model
|
|
│ └── checkpoints/ # Training checkpoints
|
|
```
|
|
|
|
## Implementation Plan
|
|
|
|
### Step 1: Create Model Architecture
|
|
- Custom transformer implementation
|
|
- Emotion classification head
|
|
- Memory attention mechanism
|
|
|
|
### Step 2: Create Tokenizer
|
|
- Train BPE tokenizer on diverse text
|
|
- 32k vocab size
|
|
- Special tokens for emotions/actions
|
|
|
|
### Step 3: Data Pipeline
|
|
- Download/prepare base training data
|
|
- Create custom personality dataset
|
|
- Build efficient data loaders
|
|
|
|
### Step 4: Training Loop
|
|
- Implement training script
|
|
- Add logging (wandb/tensorboard)
|
|
- Checkpoint management
|
|
- Evaluation metrics
|
|
|
|
### Step 5: Integration
|
|
- Load model in app
|
|
- Inference optimization (quantization, caching)
|
|
- Real-time response generation
|
|
|
|
## Alternative: Bootstrap Approach
|
|
|
|
If training from scratch takes too long, we can:
|
|
1. Start with a small pre-trained model (Phi-2, TinyLlama)
|
|
2. Fine-tune heavily on personality data
|
|
3. Add emotion head on top
|
|
4. Much faster (hours instead of days)
|
|
|
|
**Recommendation:** Start with bootstrap approach, transition to full custom model later if needed.
|
|
|
|
## Next Steps
|
|
|
|
1. Choose approach (from-scratch vs bootstrap)
|
|
2. Set up training environment
|
|
3. Create initial personality dataset
|
|
4. Implement model architecture
|
|
5. Begin training
|
|
|
|
What do you think? Should we go full custom from scratch, or bootstrap from a small existing model?
|