# Training Rosie From Scratch ## Overview This guide will help you train Rosie's custom language model from scratch using your own data. ## Hardware Requirements **Minimum:** - NVIDIA GPU with 12GB VRAM (your setup) - 32GB RAM - 500GB free disk space (for datasets) **Training Time Estimates:** - Phase 1 (Base Language): 3-7 days - Phase 2 (Personality): 1-2 days - Phase 3 (Emotion): 6-12 hours ## Setup ### 1. Install Training Dependencies ```bash pip install -r requirements-training.txt ``` ### 2. Prepare Training Data You need text data for training. Options: #### Option A: Use Existing Datasets ```python # Download common datasets from datasets import load_dataset # Books corpus books = load_dataset("bookcorpus", split="train") # Wikipedia wiki = load_dataset("wikipedia", "20220301.en", split="train") # Reddit conversations (filtered) reddit = load_dataset("reddit", split="train") ``` #### Option B: Collect Your Own Data - Web scraping (blogs, forums, stories) - Transcripts (anime, VTuber streams) - Books (Project Gutenberg, public domain) - Your own writing ### 3. Create Personality Dataset Create `data/personality.json`: ```json { "texts": [ "User: Hello! Rosie: Hey there! ✨ What's up?", "User: *pats Rosie* Rosie: Hehe~ That tickles! 💕", "User: How are you? Rosie: I'm doing great! Ready to help with whatever you need~", "User: *drags Rosie around* Rosie: Eep! 💕 Where are we going?", "User: Good morning! Rosie: Morning! ☀️ Did you sleep well?", "User: What's your name? Rosie: I'm Rosie! Your playful desktop companion~", "User: Can you help me? Rosie: Of course! That's what I'm here for! What do you need help with?", "User: Tell me a joke. Rosie: Why don't scientists trust atoms? Because they make up everything! ✨", "User: *double clicks* Rosie: Oh! Did you want to chat? I'm all ears~", "User: You're cute. Rosie: Aww, thank you! 💖 You're pretty nice yourself!", "User: What can you do? Rosie: I can chat with you, help with tasks, and just keep you company! Plus I'm always here on your desktop~", "User: I'm bored. Rosie: Hmm, want to play a word game? Or I could tell you something interesting!", "User: I'm sad. Rosie: Aww, I'm sorry to hear that... 💙 Want to talk about it? I'm here for you.", "User: I'm happy! Rosie: Yay! I'm so glad! Your happiness makes me happy too! 🌟", "User: What's 2+2? Rosie: That's 4! Easy peasy~ Need help with anything else?", "User: Goodbye. Rosie: See you later! Come back soon, okay? 👋💕" ] } ``` Create MORE examples (aim for 1000-10000) with variations! ## Training Process ### Phase 1: Base Language Training Train on large general corpus (books, web text): ```bash python train_rosie.py \ --data_path data/base_corpus.json \ --output_dir models/rosie_base \ --vocab_size 32000 \ --hidden_size 768 \ --num_layers 12 \ --batch_size 4 \ --epochs 3 \ --lr 1e-4 ``` **Tips:** - Use mixed precision if you run out of VRAM - Start with small dataset to test (1000 texts) - Monitor loss - should decrease steadily ### Phase 2: Personality Fine-tuning Fine-tune on personality dataset: ```bash python train_rosie.py \ --data_path data/personality.json \ --output_dir models/rosie_personality \ --vocab_size 32000 \ --batch_size 8 \ --epochs 10 \ --lr 5e-5 ``` Load the base checkpoint first, then continue training. ### Phase 3: Emotion Training Add emotion labels to your dataset: ```json { "texts": [ {"text": "Hello! ✨", "emotion": "happy"}, {"text": "Eep! 💕", "emotion": "surprised"}, {"text": "I'm here for you...", "emotion": "sad"} ] } ``` Train with emotion head enabled. ## Monitoring Training ### TensorBoard ```bash tensorboard --logdir models/rosie_model/logs ``` Open http://localhost:6006 ### Weights & Biases (recommended) ```bash # Login wandb login # Will auto-log to wandb dashboard ``` ## Testing the Model Create `test_rosie.py`: ```python import torch from src.llm.model import RosieModel, RosieConfig from src.llm.tokenizer import RosieTokenizer # Load model config = RosieConfig() model = RosieModel(config) model.load_state_dict(torch.load('models/rosie_model/rosie_final.pth')) model.eval() # Load tokenizer tokenizer = RosieTokenizer() tokenizer.load('models/rosie_model/tokenizer') # Test generation prompt = "User: Hello! Rosie:" input_ids = torch.tensor([tokenizer.encode(prompt)]) output_ids = model.generate(input_ids, max_length=50) response = tokenizer.decode(output_ids[0].tolist()) print(response) ``` ## Optimizations ### If Training is Too Slow: 1. Reduce batch size (but use gradient accumulation) 2. Reduce sequence length (--max_length 256) 3. Use fewer layers (--num_layers 8) 4. Enable mixed precision training ### If Running Out of Memory: 1. Reduce batch size to 1 2. Enable gradient checkpointing 3. Reduce hidden size (--hidden_size 512) 4. Use smaller model (see config) ## Data Collection Tips ### For Base Training (10B+ tokens): - **OpenWebText**: https://skylion007.github.io/OpenWebTextCorpus/ - **The Pile**: https://pile.eleuther.ai/ (800GB) - **Wikipedia**: https://dumps.wikimedia.org/ - **BookCorpus**: Available via HuggingFace datasets ### For Personality (100k+ examples): - Write your own dialogues - Use character.ai exports (if allowed) - Anime/VTuber transcripts - Reddit r/casualconversation - Fiction books with dialogue ### Quality > Quantity - Focus on clean, well-formatted data - Remove spam, toxic content, formatting issues - For personality, consistency is key! ## Next Steps 1. **Collect base training data** (this is the hard part) 2. **Create personality dataset** (write Rosie's dialogue) 3. **Train Phase 1** (base language) 4. **Train Phase 2** (personality) 5. **Integrate into app** Ready to start? I recommend: 1. Create a small test dataset (1000 texts) first 2. Train for 1 epoch to verify everything works 3. Then scale up to full training Let me know if you need help with any step!