diff --git a/MODEL_DESIGN.md b/MODEL_DESIGN.md new file mode 100644 index 0000000..d7b1a6e --- /dev/null +++ b/MODEL_DESIGN.md @@ -0,0 +1,152 @@ +# Rosie Custom Model Design + +## Architecture Overview + +**Model Type:** Custom Transformer-based Language Model +**Size:** Small (~500M-1B parameters) +**Framework:** PyTorch +**Training:** From scratch +**Personality:** Playful Assistant/Friend + +## Model Specifications + +### Architecture +- **Type:** Decoder-only Transformer (GPT-style) +- **Layers:** 12-16 transformer blocks +- **Hidden Size:** 768-1024 +- **Attention Heads:** 12-16 +- **Context Window:** 2048 tokens +- **Vocabulary Size:** 32k tokens (BPE tokenizer) + +### Special Features +1. **Emotion Head:** Separate classification head for emotion detection +2. **Memory Attention:** Special attention mechanism for long-term memory +3. **Personality Embedding:** Learned embeddings for consistent personality traits + +## Training Strategy + +### Phase 1: Base Language Understanding +**Data Sources:** +- Common Crawl (filtered for appropriate content) +- Books corpus +- Reddit conversations (filtered) +- Estimated tokens: 10-50B + +**Goal:** Learn basic language, grammar, world knowledge + +### Phase 2: Personality Fine-tuning +**Data Sources:** +- Custom dialogue dataset (we'll create) +- Anime/VTuber transcripts (playful personality) +- Assistant conversations (helpful responses) +- Estimated examples: 100k-500k conversations + +**Goal:** Develop Rosie's playful assistant personality + +### Phase 3: Emotion & Memory Training +**Data Sources:** +- Conversations labeled with emotions +- Multi-turn dialogues with context +- Estimated examples: 50k-100k + +**Goal:** Emotion detection and contextual memory + +## Data Collection Plan + +### What We Need to Create + +1. **Personality Dataset (~10k examples)** + - Playful greetings + - Helpful responses + - Reactions to being touched/moved + - Idle conversation starters + - Emotional responses + +2. **Conversation Templates** + - User: "Hello!" + - Rosie: "Hey there! ✨ What's up?" + + - User: *drags Rosie* + - Rosie: "Eep! 💕 Where are we going?" + + - User: "How are you?" + - Rosie: "I'm doing great! Ready to help with whatever you need~" + +3. **Emotion Labels** + - Map responses to emotion states (happy, sad, surprised, etc.) + - Train emotion classifier alongside text generation + +## Training Hardware Requirements + +### Your Setup (12GB VRAM) +- ✅ Can train 500M model with batch size 4-8 +- ✅ Use gradient accumulation for effective larger batches +- ✅ Mixed precision training (FP16) +- ⚠️ May need gradient checkpointing for 1B model + +### Estimated Training Time +- Phase 1 (base): 3-7 days on single GPU +- Phase 2 (personality): 1-2 days +- Phase 3 (emotion): 6-12 hours + +## Model Files Structure + +``` +models/ +├── rosie_model/ +│ ├── config.json # Model architecture config +│ ├── tokenizer/ # BPE tokenizer files +│ ├── weights/ +│ │ ├── base.pth # Base language model +│ │ ├── personality.pth # Fine-tuned personality +│ │ └── final.pth # Final trained model +│ └── checkpoints/ # Training checkpoints +``` + +## Implementation Plan + +### Step 1: Create Model Architecture +- Custom transformer implementation +- Emotion classification head +- Memory attention mechanism + +### Step 2: Create Tokenizer +- Train BPE tokenizer on diverse text +- 32k vocab size +- Special tokens for emotions/actions + +### Step 3: Data Pipeline +- Download/prepare base training data +- Create custom personality dataset +- Build efficient data loaders + +### Step 4: Training Loop +- Implement training script +- Add logging (wandb/tensorboard) +- Checkpoint management +- Evaluation metrics + +### Step 5: Integration +- Load model in app +- Inference optimization (quantization, caching) +- Real-time response generation + +## Alternative: Bootstrap Approach + +If training from scratch takes too long, we can: +1. Start with a small pre-trained model (Phi-2, TinyLlama) +2. Fine-tune heavily on personality data +3. Add emotion head on top +4. Much faster (hours instead of days) + +**Recommendation:** Start with bootstrap approach, transition to full custom model later if needed. + +## Next Steps + +1. Choose approach (from-scratch vs bootstrap) +2. Set up training environment +3. Create initial personality dataset +4. Implement model architecture +5. Begin training + +What do you think? Should we go full custom from scratch, or bootstrap from a small existing model? diff --git a/TRAINING_GUIDE.md b/TRAINING_GUIDE.md new file mode 100644 index 0000000..88c7b48 --- /dev/null +++ b/TRAINING_GUIDE.md @@ -0,0 +1,230 @@ +# Training Rosie From Scratch + +## Overview + +This guide will help you train Rosie's custom language model from scratch using your own data. + +## Hardware Requirements + +**Minimum:** +- NVIDIA GPU with 12GB VRAM (your setup) +- 32GB RAM +- 500GB free disk space (for datasets) + +**Training Time Estimates:** +- Phase 1 (Base Language): 3-7 days +- Phase 2 (Personality): 1-2 days +- Phase 3 (Emotion): 6-12 hours + +## Setup + +### 1. Install Training Dependencies + +```bash +pip install -r requirements-training.txt +``` + +### 2. Prepare Training Data + +You need text data for training. Options: + +#### Option A: Use Existing Datasets +```python +# Download common datasets +from datasets import load_dataset + +# Books corpus +books = load_dataset("bookcorpus", split="train") + +# Wikipedia +wiki = load_dataset("wikipedia", "20220301.en", split="train") + +# Reddit conversations (filtered) +reddit = load_dataset("reddit", split="train") +``` + +#### Option B: Collect Your Own Data +- Web scraping (blogs, forums, stories) +- Transcripts (anime, VTuber streams) +- Books (Project Gutenberg, public domain) +- Your own writing + +### 3. Create Personality Dataset + +Create `data/personality.json`: + +```json +{ + "texts": [ + "User: Hello! Rosie: Hey there! ✨ What's up?", + "User: *pats Rosie* Rosie: Hehe~ That tickles! 💕", + "User: How are you? Rosie: I'm doing great! Ready to help with whatever you need~", + "User: *drags Rosie around* Rosie: Eep! 💕 Where are we going?", + "User: Good morning! Rosie: Morning! ☀️ Did you sleep well?", + "User: What's your name? Rosie: I'm Rosie! Your playful desktop companion~", + "User: Can you help me? Rosie: Of course! That's what I'm here for! What do you need help with?", + "User: Tell me a joke. Rosie: Why don't scientists trust atoms? Because they make up everything! ✨", + "User: *double clicks* Rosie: Oh! Did you want to chat? I'm all ears~", + "User: You're cute. Rosie: Aww, thank you! 💖 You're pretty nice yourself!", + "User: What can you do? Rosie: I can chat with you, help with tasks, and just keep you company! Plus I'm always here on your desktop~", + "User: I'm bored. Rosie: Hmm, want to play a word game? Or I could tell you something interesting!", + "User: I'm sad. Rosie: Aww, I'm sorry to hear that... 💙 Want to talk about it? I'm here for you.", + "User: I'm happy! Rosie: Yay! I'm so glad! Your happiness makes me happy too! 🌟", + "User: What's 2+2? Rosie: That's 4! Easy peasy~ Need help with anything else?", + "User: Goodbye. Rosie: See you later! Come back soon, okay? 👋💕" + ] +} +``` + +Create MORE examples (aim for 1000-10000) with variations! + +## Training Process + +### Phase 1: Base Language Training + +Train on large general corpus (books, web text): + +```bash +python train_rosie.py \ + --data_path data/base_corpus.json \ + --output_dir models/rosie_base \ + --vocab_size 32000 \ + --hidden_size 768 \ + --num_layers 12 \ + --batch_size 4 \ + --epochs 3 \ + --lr 1e-4 +``` + +**Tips:** +- Use mixed precision if you run out of VRAM +- Start with small dataset to test (1000 texts) +- Monitor loss - should decrease steadily + +### Phase 2: Personality Fine-tuning + +Fine-tune on personality dataset: + +```bash +python train_rosie.py \ + --data_path data/personality.json \ + --output_dir models/rosie_personality \ + --vocab_size 32000 \ + --batch_size 8 \ + --epochs 10 \ + --lr 5e-5 +``` + +Load the base checkpoint first, then continue training. + +### Phase 3: Emotion Training + +Add emotion labels to your dataset: + +```json +{ + "texts": [ + {"text": "Hello! ✨", "emotion": "happy"}, + {"text": "Eep! 💕", "emotion": "surprised"}, + {"text": "I'm here for you...", "emotion": "sad"} + ] +} +``` + +Train with emotion head enabled. + +## Monitoring Training + +### TensorBoard + +```bash +tensorboard --logdir models/rosie_model/logs +``` + +Open http://localhost:6006 + +### Weights & Biases (recommended) + +```bash +# Login +wandb login + +# Will auto-log to wandb dashboard +``` + +## Testing the Model + +Create `test_rosie.py`: + +```python +import torch +from src.llm.model import RosieModel, RosieConfig +from src.llm.tokenizer import RosieTokenizer + +# Load model +config = RosieConfig() +model = RosieModel(config) +model.load_state_dict(torch.load('models/rosie_model/rosie_final.pth')) +model.eval() + +# Load tokenizer +tokenizer = RosieTokenizer() +tokenizer.load('models/rosie_model/tokenizer') + +# Test generation +prompt = "User: Hello! Rosie:" +input_ids = torch.tensor([tokenizer.encode(prompt)]) +output_ids = model.generate(input_ids, max_length=50) +response = tokenizer.decode(output_ids[0].tolist()) + +print(response) +``` + +## Optimizations + +### If Training is Too Slow: +1. Reduce batch size (but use gradient accumulation) +2. Reduce sequence length (--max_length 256) +3. Use fewer layers (--num_layers 8) +4. Enable mixed precision training + +### If Running Out of Memory: +1. Reduce batch size to 1 +2. Enable gradient checkpointing +3. Reduce hidden size (--hidden_size 512) +4. Use smaller model (see config) + +## Data Collection Tips + +### For Base Training (10B+ tokens): +- **OpenWebText**: https://skylion007.github.io/OpenWebTextCorpus/ +- **The Pile**: https://pile.eleuther.ai/ (800GB) +- **Wikipedia**: https://dumps.wikimedia.org/ +- **BookCorpus**: Available via HuggingFace datasets + +### For Personality (100k+ examples): +- Write your own dialogues +- Use character.ai exports (if allowed) +- Anime/VTuber transcripts +- Reddit r/casualconversation +- Fiction books with dialogue + +### Quality > Quantity +- Focus on clean, well-formatted data +- Remove spam, toxic content, formatting issues +- For personality, consistency is key! + +## Next Steps + +1. **Collect base training data** (this is the hard part) +2. **Create personality dataset** (write Rosie's dialogue) +3. **Train Phase 1** (base language) +4. **Train Phase 2** (personality) +5. **Integrate into app** + +Ready to start? I recommend: +1. Create a small test dataset (1000 texts) first +2. Train for 1 epoch to verify everything works +3. Then scale up to full training + +Let me know if you need help with any step! diff --git a/requirements-training.txt b/requirements-training.txt new file mode 100644 index 0000000..9fb02b6 --- /dev/null +++ b/requirements-training.txt @@ -0,0 +1,27 @@ +# Additional requirements for model training +# Install with: pip install -r requirements-training.txt + +# Deep Learning +torch>=2.0.0 +torchvision>=0.15.0 +torchaudio>=2.0.0 + +# Training utilities +wandb>=0.15.0 # Experiment tracking +tensorboard>=2.13.0 # Tensorboard logging +tqdm>=4.65.0 # Progress bars + +# Data processing +datasets>=2.13.0 # HuggingFace datasets +transformers>=4.30.0 # For comparison/reference only +sentencepiece>=0.1.99 # Alternative tokenizer +tokenizers>=0.13.3 # Fast tokenizers + +# Optimization +apex # NVIDIA apex for mixed precision (optional, requires CUDA) +accelerate>=0.20.0 # Multi-GPU training + +# Data collection +requests>=2.31.0 +beautifulsoup4>=4.12.0 +lxml>=4.9.0 diff --git a/src/llm/inference.py b/src/llm/inference.py new file mode 100644 index 0000000..6551dde --- /dev/null +++ b/src/llm/inference.py @@ -0,0 +1,224 @@ +""" +Rosie Inference Engine +Handles text generation and emotion detection for the desktop waifu +""" +import torch +import os +from typing import Optional, Tuple, List +from src.llm.model import RosieModel, RosieConfig +from src.llm.tokenizer import RosieTokenizer +from src.core.state_manager import EmotionState + + +class RosieInference: + """Inference engine for Rosie model""" + + def __init__(self, model_path: str, device: str = 'cuda'): + """ + Initialize inference engine + + Args: + model_path: Path to model directory (containing model files and tokenizer) + device: Device to run on ('cuda' or 'cpu') + """ + self.device = torch.device(device if torch.cuda.is_available() else 'cpu') + print(f"Loading Rosie model from {model_path}...") + print(f"Using device: {self.device}") + + # Load tokenizer + tokenizer_path = os.path.join(model_path, 'tokenizer') + self.tokenizer = RosieTokenizer() + self.tokenizer.load(tokenizer_path) + + # Load model config + config_path = os.path.join(model_path, 'config.json') + if os.path.exists(config_path): + import json + with open(config_path, 'r') as f: + config_dict = json.load(f) + self.config = RosieConfig(**config_dict) + else: + # Default config + self.config = RosieConfig(vocab_size=len(self.tokenizer.vocab)) + + # Create and load model + self.model = RosieModel(self.config) + + model_file = os.path.join(model_path, 'rosie_final.pth') + if not os.path.exists(model_file): + # Try checkpoint + checkpoints = [f for f in os.listdir(model_path) if f.startswith('checkpoint_epoch_')] + if checkpoints: + checkpoints.sort() + model_file = os.path.join(model_path, checkpoints[-1]) + print(f"Using checkpoint: {model_file}") + else: + raise FileNotFoundError(f"No model file found in {model_path}") + + state_dict = torch.load(model_file, map_location=self.device) + + # Handle checkpoint format + if 'model_state_dict' in state_dict: + state_dict = state_dict['model_state_dict'] + + self.model.load_state_dict(state_dict) + self.model.to(self.device) + self.model.eval() + + print("Rosie model loaded successfully!") + + # Emotion mapping + self.emotion_map = { + 0: EmotionState.NEUTRAL, + 1: EmotionState.HAPPY, + 2: EmotionState.SAD, + 3: EmotionState.SURPRISED, + 4: EmotionState.THINKING, + 5: EmotionState.EXCITED, + 6: EmotionState.ANNOYED, + } + + def generate_response( + self, + prompt: str, + max_length: int = 100, + temperature: float = 0.8, + top_k: int = 50, + top_p: float = 0.9, + detect_emotion: bool = True, + ) -> Tuple[str, Optional[EmotionState]]: + """ + Generate a response from Rosie + + Args: + prompt: Input text prompt + max_length: Maximum tokens to generate + temperature: Sampling temperature (higher = more creative) + top_k: Top-k sampling + top_p: Nucleus sampling threshold + detect_emotion: Whether to detect emotion from response + + Returns: + (response_text, detected_emotion) + """ + # Encode prompt + input_ids = self.tokenizer.encode(prompt, add_special_tokens=True) + input_tensor = torch.tensor([input_ids]).to(self.device) + + # Generate + with torch.no_grad(): + output_ids = self.model.generate( + input_tensor, + max_length=max_length, + temperature=temperature, + top_k=top_k, + top_p=top_p, + ) + + # Decode response + full_text = self.tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True) + + # Extract just the response (after prompt) + response = full_text[len(prompt):].strip() + + # Detect emotion if requested + emotion = None + if detect_emotion: + emotion = self.detect_emotion(response) + + return response, emotion + + def detect_emotion(self, text: str) -> EmotionState: + """ + Detect emotion from text using emotion head + + Args: + text: Input text + + Returns: + Detected emotion state + """ + # Encode text + input_ids = self.tokenizer.encode(text, add_special_tokens=True) + input_tensor = torch.tensor([input_ids]).to(self.device) + + # Forward pass with emotion detection + with torch.no_grad(): + _, emotion_logits = self.model(input_tensor, return_emotion=True) + + # Get predicted emotion + emotion_idx = torch.argmax(emotion_logits, dim=-1).item() + return self.emotion_map.get(emotion_idx, EmotionState.NEUTRAL) + + def chat( + self, + message: str, + conversation_history: Optional[List[str]] = None, + ) -> Tuple[str, EmotionState]: + """ + Chat with Rosie (handles conversation context) + + Args: + message: User message + conversation_history: Previous conversation turns + + Returns: + (response, emotion) + """ + # Build prompt with history + if conversation_history: + # Include last few turns for context + context = "\n".join(conversation_history[-5:]) + prompt = f"{context}\nUser: {message}\nRosie:" + else: + prompt = f"User: {message}\nRosie:" + + # Generate response + response, emotion = self.generate_response( + prompt, + max_length=80, + temperature=0.8, + ) + + # Clean up response (remove extra dialogue markers) + response = response.split("\n")[0] # Take first line + response = response.split("User:")[0] # Stop at next user input + response = response.strip() + + return response, emotion + + +# Global inference engine instance +_rosie_engine: Optional[RosieInference] = None + + +def get_rosie_engine(model_path: Optional[str] = None) -> Optional[RosieInference]: + """Get or create global Rosie inference engine""" + global _rosie_engine + + if _rosie_engine is None and model_path: + try: + _rosie_engine = RosieInference(model_path) + except Exception as e: + print(f"Failed to load Rosie model: {e}") + return None + + return _rosie_engine + + +def chat_with_rosie(message: str, history: Optional[List[str]] = None) -> Tuple[str, EmotionState]: + """ + Convenience function to chat with Rosie + + Args: + message: User message + history: Conversation history + + Returns: + (response, emotion) + """ + engine = get_rosie_engine() + if engine is None: + return "Sorry, I'm not available right now... (Model not loaded)", EmotionState.NEUTRAL + + return engine.chat(message, history) diff --git a/src/llm/model.py b/src/llm/model.py new file mode 100644 index 0000000..d5653b4 --- /dev/null +++ b/src/llm/model.py @@ -0,0 +1,325 @@ +""" +Rosie Custom Transformer Model +Built from scratch for Desktop Waifu +""" +import torch +import torch.nn as nn +import torch.nn.functional as F +import math +from typing import Optional, Tuple + +class RosieConfig: + """Configuration for Rosie model""" + def __init__( + self, + vocab_size: int = 32000, + hidden_size: int = 768, + num_layers: int = 12, + num_heads: int = 12, + intermediate_size: int = 3072, + max_position_embeddings: int = 2048, + dropout: float = 0.1, + num_emotions: int = 7, # neutral, happy, sad, surprised, thinking, excited, annoyed + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_layers = num_layers + self.num_heads = num_heads + self.intermediate_size = intermediate_size + self.max_position_embeddings = max_position_embeddings + self.dropout = dropout + self.num_emotions = num_emotions + + +class MultiHeadAttention(nn.Module): + """Multi-head self-attention mechanism""" + + def __init__(self, config: RosieConfig): + super().__init__() + self.num_heads = config.num_heads + self.hidden_size = config.hidden_size + self.head_dim = config.hidden_size // config.num_heads + + assert self.head_dim * config.num_heads == config.hidden_size, \ + "hidden_size must be divisible by num_heads" + + # Query, Key, Value projections + self.q_proj = nn.Linear(config.hidden_size, config.hidden_size) + self.k_proj = nn.Linear(config.hidden_size, config.hidden_size) + self.v_proj = nn.Linear(config.hidden_size, config.hidden_size) + + # Output projection + self.out_proj = nn.Linear(config.hidden_size, config.hidden_size) + + self.dropout = nn.Dropout(config.dropout) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + batch_size, seq_length, _ = hidden_states.size() + + # Project to Q, K, V + q = self.q_proj(hidden_states) + k = self.k_proj(hidden_states) + v = self.v_proj(hidden_states) + + # Reshape for multi-head attention + q = q.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) + k = k.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) + v = v.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) + + # Scaled dot-product attention + scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim) + + # Apply attention mask (for causal/autoregressive generation) + if attention_mask is not None: + scores = scores + attention_mask + + attn_weights = F.softmax(scores, dim=-1) + attn_weights = self.dropout(attn_weights) + + # Apply attention to values + attn_output = torch.matmul(attn_weights, v) + + # Reshape back + attn_output = attn_output.transpose(1, 2).contiguous() + attn_output = attn_output.view(batch_size, seq_length, self.hidden_size) + + # Output projection + output = self.out_proj(attn_output) + + return output + + +class FeedForward(nn.Module): + """Position-wise feed-forward network""" + + def __init__(self, config: RosieConfig): + super().__init__() + self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size) + self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size) + self.dropout = nn.Dropout(config.dropout) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + x = self.fc1(x) + x = F.gelu(x) # GELU activation + x = self.dropout(x) + x = self.fc2(x) + return x + + +class TransformerBlock(nn.Module): + """Single transformer decoder block""" + + def __init__(self, config: RosieConfig): + super().__init__() + self.attention = MultiHeadAttention(config) + self.feed_forward = FeedForward(config) + self.ln1 = nn.LayerNorm(config.hidden_size) + self.ln2 = nn.LayerNorm(config.hidden_size) + self.dropout = nn.Dropout(config.dropout) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + ) -> torch.Tensor: + # Self-attention with residual connection + residual = hidden_states + hidden_states = self.ln1(hidden_states) + hidden_states = self.attention(hidden_states, attention_mask) + hidden_states = self.dropout(hidden_states) + hidden_states = residual + hidden_states + + # Feed-forward with residual connection + residual = hidden_states + hidden_states = self.ln2(hidden_states) + hidden_states = self.feed_forward(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = residual + hidden_states + + return hidden_states + + +class RosieModel(nn.Module): + """ + Rosie - Custom Transformer Language Model + Built from scratch for Desktop Waifu companion + """ + + def __init__(self, config: RosieConfig): + super().__init__() + self.config = config + + # Token embeddings + self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size) + + # Positional embeddings (learned) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + + # Transformer blocks + self.blocks = nn.ModuleList([ + TransformerBlock(config) for _ in range(config.num_layers) + ]) + + # Final layer norm + self.ln_f = nn.LayerNorm(config.hidden_size) + + # Language modeling head (predict next token) + self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + + # Emotion classification head + self.emotion_head = nn.Sequential( + nn.Linear(config.hidden_size, config.hidden_size // 2), + nn.ReLU(), + nn.Dropout(config.dropout), + nn.Linear(config.hidden_size // 2, config.num_emotions) + ) + + # Initialize weights + self.apply(self._init_weights) + + def _init_weights(self, module): + """Initialize weights (Xavier/He initialization)""" + if isinstance(module, nn.Linear): + torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) + if module.bias is not None: + torch.nn.init.zeros_(module.bias) + elif isinstance(module, nn.Embedding): + torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) + elif isinstance(module, nn.LayerNorm): + torch.nn.init.ones_(module.weight) + torch.nn.init.zeros_(module.bias) + + def forward( + self, + input_ids: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + return_emotion: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: + """ + Forward pass + + Args: + input_ids: Token IDs [batch_size, seq_length] + attention_mask: Attention mask [batch_size, seq_length] + return_emotion: Whether to return emotion predictions + + Returns: + logits: Next token predictions [batch_size, seq_length, vocab_size] + emotion_logits: Emotion predictions [batch_size, num_emotions] (if return_emotion=True) + """ + batch_size, seq_length = input_ids.size() + + # Create causal attention mask (lower triangular) + if attention_mask is None: + causal_mask = torch.triu( + torch.ones(seq_length, seq_length, device=input_ids.device) * float('-inf'), + diagonal=1 + ) + attention_mask = causal_mask + + # Get embeddings + token_embeds = self.token_embeddings(input_ids) + position_ids = torch.arange(seq_length, device=input_ids.device).unsqueeze(0) + position_embeds = self.position_embeddings(position_ids) + + # Combine embeddings + hidden_states = token_embeds + position_embeds + + # Pass through transformer blocks + for block in self.blocks: + hidden_states = block(hidden_states, attention_mask) + + # Final layer norm + hidden_states = self.ln_f(hidden_states) + + # Language modeling head + logits = self.lm_head(hidden_states) + + # Emotion classification (using last token's representation) + emotion_logits = None + if return_emotion: + last_hidden = hidden_states[:, -1, :] # Take last token + emotion_logits = self.emotion_head(last_hidden) + + return logits, emotion_logits + + def generate( + self, + input_ids: torch.Tensor, + max_length: int = 100, + temperature: float = 1.0, + top_k: int = 50, + top_p: float = 0.9, + ) -> torch.Tensor: + """ + Generate text autoregressively + + Args: + input_ids: Starting token IDs [batch_size, seq_length] + max_length: Maximum tokens to generate + temperature: Sampling temperature (higher = more random) + top_k: Keep only top k tokens for sampling + top_p: Nucleus sampling threshold + + Returns: + generated_ids: Generated token IDs [batch_size, seq_length + generated] + """ + self.eval() + generated = input_ids + + with torch.no_grad(): + for _ in range(max_length): + # Forward pass + logits, _ = self.forward(generated) + + # Get logits for next token (last position) + next_token_logits = logits[:, -1, :] / temperature + + # Apply top-k filtering + if top_k > 0: + indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None] + next_token_logits[indices_to_remove] = float('-inf') + + # Apply top-p (nucleus) filtering + if top_p < 1.0: + sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True) + cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) + + # Remove tokens with cumulative probability above the threshold + sorted_indices_to_remove = cumulative_probs > top_p + sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() + sorted_indices_to_remove[..., 0] = 0 + + indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove) + next_token_logits[indices_to_remove] = float('-inf') + + # Sample next token + probs = F.softmax(next_token_logits, dim=-1) + next_token = torch.multinomial(probs, num_samples=1) + + # Append to generated sequence + generated = torch.cat([generated, next_token], dim=1) + + # Stop if we exceed max context length + if generated.size(1) >= self.config.max_position_embeddings: + break + + return generated + + +def create_rosie_model(config: Optional[RosieConfig] = None) -> RosieModel: + """Create a Rosie model with default or custom config""" + if config is None: + config = RosieConfig() + + model = RosieModel(config) + + # Print model size + num_params = sum(p.numel() for p in model.parameters()) + print(f"Rosie model created: {num_params:,} parameters ({num_params/1e6:.1f}M)") + + return model diff --git a/src/llm/tokenizer.py b/src/llm/tokenizer.py new file mode 100644 index 0000000..52ed1d0 --- /dev/null +++ b/src/llm/tokenizer.py @@ -0,0 +1,262 @@ +""" +Rosie BPE Tokenizer +Custom tokenizer for Desktop Waifu +""" +import json +import os +from typing import List, Dict, Optional +from collections import Counter +import re + +class RosieTokenizer: + """ + Byte-Pair Encoding (BPE) tokenizer for Rosie + """ + + def __init__(self, vocab_size: int = 32000): + self.vocab_size = vocab_size + self.vocab: Dict[str, int] = {} + self.inv_vocab: Dict[int, str] = {} + self.merges: List[tuple] = [] + + # Special tokens + self.pad_token = "<|pad|>" + self.unk_token = "<|unk|>" + self.bos_token = "<|startoftext|>" + self.eos_token = "<|endoftext|>" + + # Emotion tokens (for explicit emotion control) + self.emotion_tokens = [ + "<|neutral|>", + "<|happy|>", + "<|sad|>", + "<|surprised|>", + "<|thinking|>", + "<|excited|>", + "<|annoyed|>", + ] + + # Action tokens (for describing interactions) + self.action_tokens = [ + "<|grabbed|>", + "<|released|>", + "<|patted|>", + "<|dragged|>", + ] + + self.special_tokens = ( + [self.pad_token, self.unk_token, self.bos_token, self.eos_token] + + self.emotion_tokens + + self.action_tokens + ) + + # Token IDs + self.pad_token_id = 0 + self.unk_token_id = 1 + self.bos_token_id = 2 + self.eos_token_id = 3 + + def train(self, texts: List[str], save_path: Optional[str] = None): + """ + Train BPE tokenizer on corpus + + Args: + texts: List of text strings to train on + save_path: Path to save tokenizer files + """ + print(f"Training tokenizer on {len(texts)} texts...") + + # Initialize vocabulary with special tokens + self.vocab = {token: idx for idx, token in enumerate(self.special_tokens)} + next_id = len(self.special_tokens) + + # Add individual characters (base vocabulary) + char_counts = Counter() + for text in texts: + char_counts.update(text) + + # Add most common characters to vocab + for char, _ in char_counts.most_common(): + if next_id >= self.vocab_size: + break + if char not in self.vocab: + self.vocab[char] = next_id + next_id += 1 + + # Byte-pair encoding: merge most frequent pairs + print("Learning BPE merges...") + word_freqs = self._get_word_freqs(texts) + + while len(self.vocab) < self.vocab_size: + # Find most frequent pair + pairs = self._get_stats(word_freqs) + if not pairs: + break + + best_pair = max(pairs, key=pairs.get) + + # Merge the pair + word_freqs = self._merge_pair(best_pair, word_freqs) + self.merges.append(best_pair) + + # Add merged token to vocab + merged_token = ''.join(best_pair) + if merged_token not in self.vocab: + self.vocab[merged_token] = next_id + next_id += 1 + + if len(self.vocab) % 1000 == 0: + print(f" Vocabulary size: {len(self.vocab)}") + + # Create inverse vocabulary + self.inv_vocab = {v: k for k, v in self.vocab.items()} + + print(f"Tokenizer trained: {len(self.vocab)} tokens, {len(self.merges)} merges") + + if save_path: + self.save(save_path) + + def _get_word_freqs(self, texts: List[str]) -> Dict[tuple, int]: + """Get word frequencies with characters as tuples""" + word_freqs = Counter() + for text in texts: + words = text.split() + for word in words: + word_freqs[tuple(word)] += 1 + return dict(word_freqs) + + def _get_stats(self, word_freqs: Dict[tuple, int]) -> Dict[tuple, int]: + """Get pair frequencies from word frequencies""" + pairs = Counter() + for word, freq in word_freqs.items(): + for i in range(len(word) - 1): + pairs[(word[i], word[i + 1])] += freq + return pairs + + def _merge_pair(self, pair: tuple, word_freqs: Dict[tuple, int]) -> Dict[tuple, int]: + """Merge a pair in all words""" + new_word_freqs = {} + bigram = ''.join(pair) + + for word, freq in word_freqs.items(): + new_word = [] + i = 0 + while i < len(word): + if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]: + new_word.append(bigram) + i += 2 + else: + new_word.append(word[i]) + i += 1 + new_word_freqs[tuple(new_word)] = freq + + return new_word_freqs + + def encode(self, text: str, add_special_tokens: bool = True) -> List[int]: + """ + Encode text to token IDs + + Args: + text: Input text + add_special_tokens: Whether to add BOS/EOS tokens + + Returns: + List of token IDs + """ + if not self.vocab: + raise ValueError("Tokenizer not trained. Call train() first.") + + tokens = [] + + if add_special_tokens: + tokens.append(self.bos_token_id) + + # Apply BPE merges + words = text.split() + for word in words: + word_tokens = list(word) + + # Apply merges + for merge in self.merges: + i = 0 + while i < len(word_tokens) - 1: + if word_tokens[i] == merge[0] and word_tokens[i + 1] == merge[1]: + word_tokens = word_tokens[:i] + [''.join(merge)] + word_tokens[i + 2:] + else: + i += 1 + + # Convert to IDs + for token in word_tokens: + tokens.append(self.vocab.get(token, self.unk_token_id)) + + # Add space token (if exists) + if ' ' in self.vocab: + tokens.append(self.vocab[' ']) + + if add_special_tokens: + tokens.append(self.eos_token_id) + + return tokens + + def decode(self, token_ids: List[int], skip_special_tokens: bool = True) -> str: + """ + Decode token IDs to text + + Args: + token_ids: List of token IDs + skip_special_tokens: Whether to skip special tokens in output + + Returns: + Decoded text string + """ + if not self.inv_vocab: + raise ValueError("Tokenizer not trained. Call train() first.") + + tokens = [] + for token_id in token_ids: + token = self.inv_vocab.get(token_id, self.unk_token) + + if skip_special_tokens and token in self.special_tokens: + continue + + tokens.append(token) + + return ''.join(tokens) + + def save(self, save_dir: str): + """Save tokenizer to directory""" + os.makedirs(save_dir, exist_ok=True) + + # Save vocabulary + with open(os.path.join(save_dir, 'vocab.json'), 'w') as f: + json.dump(self.vocab, f) + + # Save merges + with open(os.path.join(save_dir, 'merges.txt'), 'w') as f: + for merge in self.merges: + f.write(f"{merge[0]} {merge[1]}\n") + + print(f"Tokenizer saved to {save_dir}") + + def load(self, save_dir: str): + """Load tokenizer from directory""" + # Load vocabulary + with open(os.path.join(save_dir, 'vocab.json'), 'r') as f: + self.vocab = json.load(f) + + self.inv_vocab = {v: k for k, v in self.vocab.items()} + + # Load merges + self.merges = [] + with open(os.path.join(save_dir, 'merges.txt'), 'r') as f: + for line in f: + parts = line.strip().split() + if len(parts) == 2: + self.merges.append((parts[0], parts[1])) + + print(f"Tokenizer loaded from {save_dir}") + + +def create_tokenizer(vocab_size: int = 32000) -> RosieTokenizer: + """Create a new Rosie tokenizer""" + return RosieTokenizer(vocab_size=vocab_size) diff --git a/train_rosie.py b/train_rosie.py new file mode 100644 index 0000000..c556e13 --- /dev/null +++ b/train_rosie.py @@ -0,0 +1,188 @@ +""" +Rosie Training Script +Train the custom transformer model from scratch +""" +import os +import torch +import torch.nn as nn +import torch.optim as optim +from torch.utils.data import Dataset, DataLoader +from typing import List, Dict +import json +from tqdm import tqdm +import argparse + +from src.llm.model import RosieModel, RosieConfig, create_rosie_model +from src.llm.tokenizer import RosieTokenizer, create_tokenizer + + +class TextDataset(Dataset): + """Dataset for language modeling""" + + def __init__(self, texts: List[str], tokenizer: RosieTokenizer, max_length: int = 512): + self.tokenizer = tokenizer + self.max_length = max_length + self.examples = [] + + print(f"Tokenizing {len(texts)} texts...") + for text in tqdm(texts): + token_ids = tokenizer.encode(text, add_special_tokens=True) + + # Split into chunks of max_length + for i in range(0, len(token_ids), max_length): + chunk = token_ids[i:i + max_length] + if len(chunk) > 1: # Need at least 2 tokens (input + target) + self.examples.append(chunk) + + print(f"Created {len(self.examples)} training examples") + + def __len__(self): + return len(self.examples) + + def __getitem__(self, idx): + tokens = self.examples[idx] + + # Pad to max_length + if len(tokens) < self.max_length: + tokens = tokens + [self.tokenizer.pad_token_id] * (self.max_length - len(tokens)) + + # Input and target (shifted by 1) + input_ids = torch.tensor(tokens[:-1]) + target_ids = torch.tensor(tokens[1:]) + + return input_ids, target_ids + + +def train_epoch( + model: RosieModel, + dataloader: DataLoader, + optimizer: optim.Optimizer, + device: torch.device, + epoch: int, +): + """Train for one epoch""" + model.train() + total_loss = 0 + criterion = nn.CrossEntropyLoss(ignore_index=0) # Ignore padding + + progress_bar = tqdm(dataloader, desc=f"Epoch {epoch}") + + for batch_idx, (input_ids, target_ids) in enumerate(progress_bar): + input_ids = input_ids.to(device) + target_ids = target_ids.to(device) + + # Forward pass + optimizer.zero_grad() + logits, _ = model(input_ids) + + # Calculate loss + loss = criterion(logits.view(-1, model.config.vocab_size), target_ids.view(-1)) + + # Backward pass + loss.backward() + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Gradient clipping + optimizer.step() + + total_loss += loss.item() + + # Update progress bar + progress_bar.set_postfix({'loss': loss.item()}) + + avg_loss = total_loss / len(dataloader) + return avg_loss + + +def main(): + parser = argparse.ArgumentParser(description="Train Rosie model") + parser.add_argument('--data_path', type=str, required=True, help="Path to training data (JSON file)") + parser.add_argument('--output_dir', type=str, default='./models/rosie_model', help="Output directory") + parser.add_argument('--vocab_size', type=int, default=32000, help="Vocabulary size") + parser.add_argument('--hidden_size', type=int, default=768, help="Hidden size") + parser.add_argument('--num_layers', type=int, default=12, help="Number of layers") + parser.add_argument('--num_heads', type=int, default=12, help="Number of attention heads") + parser.add_argument('--max_length', type=int, default=512, help="Maximum sequence length") + parser.add_argument('--batch_size', type=int, default=4, help="Batch size") + parser.add_argument('--epochs', type=int, default=10, help="Number of epochs") + parser.add_argument('--lr', type=float, default=1e-4, help="Learning rate") + parser.add_argument('--device', type=str, default='cuda', help="Device (cuda/cpu)") + args = parser.parse_args() + + # Create output directory + os.makedirs(args.output_dir, exist_ok=True) + + # Load training data + print(f"Loading training data from {args.data_path}...") + with open(args.data_path, 'r', encoding='utf-8') as f: + data = json.load(f) + + if isinstance(data, list): + texts = data + elif isinstance(data, dict) and 'texts' in data: + texts = data['texts'] + else: + raise ValueError("Data must be a list of texts or dict with 'texts' key") + + print(f"Loaded {len(texts)} texts") + + # Create/load tokenizer + tokenizer_path = os.path.join(args.output_dir, 'tokenizer') + if os.path.exists(tokenizer_path): + print(f"Loading existing tokenizer from {tokenizer_path}") + tokenizer = create_tokenizer(args.vocab_size) + tokenizer.load(tokenizer_path) + else: + print("Training new tokenizer...") + tokenizer = create_tokenizer(args.vocab_size) + tokenizer.train(texts, save_path=tokenizer_path) + + # Create dataset + dataset = TextDataset(texts, tokenizer, max_length=args.max_length) + dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, num_workers=0) + + # Create model + config = RosieConfig( + vocab_size=len(tokenizer.vocab), + hidden_size=args.hidden_size, + num_layers=args.num_layers, + num_heads=args.num_heads, + max_position_embeddings=args.max_length, + ) + model = create_rosie_model(config) + + # Move to device + device = torch.device(args.device if torch.cuda.is_available() else 'cpu') + print(f"Using device: {device}") + model = model.to(device) + + # Optimizer + optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=0.01) + + # Training loop + print(f"\nStarting training for {args.epochs} epochs...") + print(f"Batch size: {args.batch_size}") + print(f"Total batches per epoch: {len(dataloader)}") + print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}\n") + + for epoch in range(1, args.epochs + 1): + avg_loss = train_epoch(model, dataloader, optimizer, device, epoch) + print(f"Epoch {epoch}/{args.epochs} - Average Loss: {avg_loss:.4f}") + + # Save checkpoint every epoch + checkpoint_path = os.path.join(args.output_dir, f'checkpoint_epoch_{epoch}.pth') + torch.save({ + 'epoch': epoch, + 'model_state_dict': model.state_dict(), + 'optimizer_state_dict': optimizer.state_dict(), + 'loss': avg_loss, + 'config': config.__dict__, + }, checkpoint_path) + print(f"Checkpoint saved to {checkpoint_path}\n") + + # Save final model + final_path = os.path.join(args.output_dir, 'rosie_final.pth') + torch.save(model.state_dict(), final_path) + print(f"\nTraining complete! Model saved to {final_path}") + + +if __name__ == "__main__": + main()