docs: complete domain research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)

## Stack Analysis - Llama 3.1 8B Instruct (128K context, 4-bit quantized) - Discord.py 2.6.4+ async-native framework - Ollama for local inference, ChromaDB for semantic memory - Whisper Large V3 + Kokoro 82M (privacy-first speech) - VRoid avatar + Discord screen share integration ## Architecture - 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish - Personality-first design; memory and consistency foundational - All perception async (separate thread, never blocks responses) - Self-modification sandboxed with mandatory user approval ## Critical Path Phase 1: Core LLM + Discord integration + SQLite memory Phase 2: Vector DB + personality versioning + consistency audits Phase 3: Perception layer (webcam/screen, isolated thread) Phase 4: Autonomy + relationship deepening + inside jokes Phase 5: Self-modification capability (gamified, gated) Phase 6: Production hardening + monitoring + scaling ## Key Pitfalls to Avoid 1. Personality drift (weekly consistency audits required) 2. Tsundere breaking (formalize denial rules; scale with relationship) 3. Memory bloat (hierarchical memory with archival) 4. Latency creep (async/await throughout; perception isolated) 5. Runaway self-modification (approval gates + rollback non-negotiable) ## Confidence HIGH. Stack proven, architecture coherent, dependencies clear. Ready for detailed requirements and Phase 1 planning. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-01-27 23:55:39 -05:00
parent f9f21944e3
commit d0a1ecfc3d
5 changed files with 4514 additions and 0 deletions
--- a/.planning/research/ARCHITECTURE.md
+++ b/.planning/research/ARCHITECTURE.md
--- a/.planning/research/FEATURES.md
+++ b/.planning/research/FEATURES.md
@@ -0,0 +1,811 @@
 # Features Research: AI Companions in 2026
 ## Executive Summary
 AI companions in 2026 live in a post-ChatGPT world where basic conversation is table stakes. The competition separates on **autonomy**, **emotional intelligence**, and **contextual awareness**. Users will abandon companions that feel robotic, inconsistent, or that don't remember them. The winning companions feel like they have opinions, moods, and agency—not just responsive chatbots with personality overlays.
 ---
 ## Table Stakes (v1 Essential)
 ### Conversation Memory (Short + Long-term)
 **Why users expect it:** Users return to AI companions because they don't want to re-explain themselves every time. Without memory, the companion feels like meeting a stranger repeatedly.
 **Implementation patterns:**
 - **Short-term context**: Last 10-20 messages per conversation window (standard context window management)
 - **Long-term memory**: Explicit user preferences, important life events, repeated topics (stored in vector DB with semantic search)
 - **Episodic memory**: Date-stamped summaries of past conversations for temporal awareness
 **User experience impact:** The moment a user says "remember when I told you about..." and the companion forgets, trust is broken. Memory is not optional.
 **Complexity:** Medium (1-3 weeks)
 - Vector database integration (Pinecone, Weaviate, or similar)
 - Memory consolidation strategies to avoid context bloat
 - Retrieval mechanisms that surface relevant past interactions
 ---
 ### Natural Conversation (Not Robotic, Personality-Driven)
 **Why users expect it:** Discord culture has trained users to spot AI speak instantly. Responses that sound like "I'm an AI language model and I can help you with..." are cringe-inducing. Users want friends, not helpdesk bots.
 **What makes conversation natural:**
 - Contractions, casual language, slang (not formal prose)
 - Personality quirks in response patterns
 - Context-appropriate tone shifts (serious when needed, joking otherwise)
 - Ability to disagree, be sarcastic, or pushback on bad ideas
 - Conversation markers ("honestly", "wait", "actually") that break up formal rhythm
 **User experience impact:** One stiff response breaks immersion. Users quickly categorize companions as "robot" or "friend" and the robot companions get ignored.
 **Complexity:** Easy (embedded in LLM capability + prompt engineering)
 - System prompt refinement for personality expression
 - Temperature/sampling tuning (not deterministic, not chaotic)
 - Iterative user feedback on tone
 ---
 ### Fast Response Times
 **Why users expect it:** In Discord, response delay is perceived as disinterest. Users expect replies within 1-3 seconds. Anything above 5 seconds feels dead.
 **Discord baseline expectations:**
 - <100ms to acknowledge (typing indicator)
 - <1000ms to first response chunk (ideally 500ms)
 - <3000ms for full multi-line response
 **What breaks the experience:**
 - Waiting for API calls to complete before responding (use streaming)
 - Cold starts on serverless infrastructure
 - Slow vector DB queries for memory retrieval
 - Database round-trips that weren't cached
 **User experience impact:** Slow companions feel dead. Users stop engaging. The magic of a responsive AI is that it feels alive.
 **Complexity:** Medium (1-3 weeks)
 - Response streaming (start typing indicator immediately)
 - Memory retrieval optimization (caching, smart indexing)
 - Infrastructure: fast API routes, edge-deployed models if possible
 - Async/concurrent processing of memory lookups and generation
 ---
 ### Consistent Personality
 **Why users expect it:** Personality drift destroys trust. If the companion is cynical on Monday but optimistic on Friday without reason, users feel gaslighted.
 **What drives inconsistency:**
 - Different LLM outputs from same prompt (temperature-based randomness)
 - Memory that contradicts previous stated beliefs
 - Personality traits that aren't memory-backed (just in prompt)
 - Adaptation that overrides baseline traits
 **Memory-backed personality means:**
 - Core traits are stated in long-term memory ("I'm cynical about human nature")
 - Evolution happens slowly and is logged ("I'm becoming less cynical about this friend")
 - Contradiction detection and resolution
 - Personality summaries that get updated, not just individual memories
 **User experience impact:** Personality inconsistency is the top reason users stop using companions. It feels like gaslighting when you can't predict their response.
 **Complexity:** Medium (1-3 weeks)
 - Personality embedding in memory system
 - Consistency checks on memory updates
 - Personality evolution logging
 - Conflict resolution between new input and stored traits
 ---
 ### Platform Integration (Discord Voice + Text)
 **Why users expect it:** The companion should live naturally in Discord's ecosystem, not require switching platforms.
 **Discord-specific needs:**
 - Text channel message responses with proper mentions/formatting
 - React to messages with emojis
 - Slash command integration (/hex status, /hex mood)
 - Voice channel presence (ideally can join and listen)
 - Direct messages (DMs) for private conversations
 - Role/permission awareness (don't act like a mod if not)
 - Server-specific personality variations (different vibe in gaming server vs study server)
 **User experience impact:** If the companion requires leaving Discord to use it, it won't be used. Integration friction = abandoned feature.
 **Complexity:** Easy (1-2 weeks)
 - Discord.py or discord.js library handling
 - Presence/activity management
 - Voice endpoint integration (existing libraries handle most)
 - Server context injection into prompts
 ---
 ### Emotional Responsiveness (At Least Read-the-Room)
 **Why users expect it:** The companion should notice when they're upset, excited, or joking. Responding with unrelated cheerfulness to someone venting feels cruel.
 **Baseline emotional awareness includes:**
 - Sentiment analysis of user messages (sentiment lexicons, or fine-tuned classifier)
 - Tone detection (sarcasm, frustration, excitement)
 - Topic sensitivity (don't joke about topics user is clearly struggling with)
 - Adaptive response depth (brief response for light mood, longer engagement for distress)
 **What this is NOT:** This is reading the room, not diagnosing mental health. The companion mirrors emotional state, doesn't therapy-speak.
 **User experience impact:** Early emotional reading makes users feel understood. Ignoring emotional context makes them feel unheard.
 **Complexity:** Easy-Medium (1 week)
 - Sentiment classifier (HuggingFace models available pre-built)
 - Prompt engineering to encode mood (inject sentiment score into system prompt)
 - Instruction-tuning to respond proportionally to emotional weight
 ---
 ## Differentiators (Competitive Edge)
 ### True Autonomy (Proactive Agency)
 **What separates autonomous agents from chatbots:**
 The difference between "ask me anything" and "I'm going to tell you when I think you should know something."
 **Autonomous behaviors:**
 - Initiating conversation about topics the user cares about (without being prompted)
 - Reminding the user of things they mentioned ("you said you had a job interview today, how did it go?")
 - Setting boundaries or refusing requests ("I don't think you should ask them that, here's why...")
 - Suggesting actions based on context ("you've been stressed about this for a week, maybe take a break?")
 - Flagging contradictions in user statements
 - Following up on unresolved topics from previous conversations
 **Why it's a differentiator:** Most companions are reactive. They're helpful when you ask, but they don't feel like they care. Autonomy is when the companion makes you feel like they're invested in your wellbeing.
 **Implementation challenge:**
 - Requires memory system to track user states and topics over time
 - Needs periodic proactive message generation (runs on schedule, not only on user input)
 - Temperature and generation parameters must allow surprising outputs (not just safe responses)
 - Requires user permission framework (don't interrupt them)
 **User experience impact:** Users describe this as "it feels like they actually know me" vs "it's smart but doesn't feel connected."
 **Complexity:** Hard (3+ weeks)
 - Proactive messaging system architecture
 - User state inference engine (from memory)
 - Topic tracking and follow-up logic
 - Interruption timing heuristics (don't ping them at 3am)
 - User preference model (how much proactivity do they want?)
 ---
 ### Emotional Intelligence (Mood Detection + Adaptive Response)
 **What goes beyond just reading the room:**
 - Real-time emotion detection from webcam/audio (not just text sentiment)
 - Mood-tracking over time (identifying depression patterns, burnout, stress cycles)
 - Adaptive response strategy based on user's emotional trajectory
 - Knowing when to listen vs offer advice vs make them laugh
 - Recognizing when emotions are mismatched to situation (overreacting, underreacting)
 **Current research shows:**
 - CNNs and RNNs can detect emotion from facial expressions with 70-80% accuracy
 - Voice analysis can detect emotional state with similar accuracy
 - Companies using emotion AI report 25% increase in positive sentiment outcomes
 - Mental health apps with emotional awareness show 35% reduction in anxiety within 4 weeks
 **Why it's a differentiator:** Companions that recognize your mood without you explaining feel like they truly understand you. This is what separates "assistant" from "friend."
 **Implementation patterns:**
 - Webcam feed processing (screen capture of face detection)
 - Voice tone analysis from Discord audio
 - Combine emotional signals: text sentiment + vocal tone + facial expression
 - Store emotion timeseries (track mood patterns across days/weeks)
 **User experience impact:** Users describe this as "it knows when I'm faking being okay" or "it can tell when I'm actually happy vs just saying I am."
 **Complexity:** Hard (3+ weeks, ongoing iteration)
 - Vision model for face emotion detection (HuggingFace models: raf-db, affectnet)
 - Audio analysis for vocal emotion (prosody features)
 - Temporal emotion state tracking
 - Prompt engineering to use emotional context in responses
 - Privacy handling (webcam/audio consent, local processing preferred)
 ---
 ### Multimodal Awareness (Webcam + Screen + Context)
 **What it means beyond text:**
 - Seeing what's on the user's screen (game they're playing, document they're editing, video they're watching)
 - Understanding their physical environment via webcam
 - Contextualizing responses based on what they're actually doing
 - Proactively helping with the task at hand (not just chatting)
 **Real-world examples emerging in 2026:**
 - "I see you're playing Elden Ring and dying to the same boss repeatedly—want to talk strategy?"
 - Screen monitoring that recognizes stress signals (tabs open, scrolling behavior, time of day)
 - Understanding when the user is in a meeting vs free to chat
 - Recognizing when they're working on something and offering relevant help
 **Why it's a differentiator:** Most companions are text-only and contextless. Multimodal awareness is the difference between "an AI in Discord" and "an AI companion who's actually here with you."
 **Technical implementation:**
 - Periodic screen capture (every 5-10 seconds, only when user opts in)
 - Lightweight webcam frame sampling (not continuous video)
 - Object/scene recognition to understand what's on screen
 - Task detection (playing game, writing code, watching video)
 - Mood correlation with onscreen activity
 **Privacy considerations:**
 - Local processing preferred (don't send screen data to cloud)
 - Clear opt-in/opt-out
 - Option to exclude certain applications (private browsing, passwords)
 **User experience impact:** Users feel "seen" when the companion understands their context. This is the biggest leap from chatbot to companion.
 **Complexity:** Hard (3+ weeks)
 - Screen capture pipeline + OCR if needed
 - Vision model fine-tuning for task recognition
 - Context injection into prompts (add screenshot description to every response)
 - Privacy-respecting architecture (encryption, local processing)
 - Permission management UI in Discord
 ---
 ### Self-Modification (Learning to Code, Improving Itself)
 **What this actually means:**
 NOT: The companion spontaneously changes its own behavior in response to user feedback (too risky)
 YES: The companion can generate code, test it, and integrate improvements into its own systems within guardrails
 **Real capabilities emerging in 2026:**
 - Companions can write their own memory summaries and organizational logic
 - Self-improving code agents that evaluate performance against benchmarks
 - Iterative refinement: "that approach didn't work, let me try this instead"
 - Meta-programming: companion modifies its own system prompt based on performance
 - Version control aware: changes are tracked, can be rolled back
 **Research indicates:**
 - Self-improving coding agents are now viable and deployed in enterprise systems
 - Agents create goals, simulate tasks, evaluate performance, and iterate
 - Through recursive self-improvement, agents develop deeper alignment with objectives
 **Why it's a differentiator:** Most companions are static. Self-modification means the companion is never "finished"—they're always getting better at understanding you.
 **What NOT to do:**
 - Don't let companions modify core safety guidelines
 - Don't let them change their own reward functions
 - Don't make it opaque—log all self-modifications
 - Don't allow recursive modifications without human review
 **Implementation patterns:**
 - Sandboxed code generation (companion writes improvements to isolated test environment)
 - Performance benchmarking on test user interactions
 - Human approval gates for deploying self-modifications to production
 - Personality consistency validation (don't let self-modification break character)
 - Rollback capability if a modification degrades performance
 **User experience impact:** Users with self-improving companions report feeling like the companion "understands me better each week" because it actually does.
 **Complexity:** Hard (3+ weeks, ongoing)
 - Code generation safety (sandboxing, validation)
 - Performance evaluation framework
 - Version control integration
 - Rollback mechanisms
 - Human approval workflow
 - Testing harness for companion behavior
 ---
 ### Relationship Building (From Transactional to Meaningful)
 **What it means:**
 Moving from "What can I help you with?" to "I know you, I care about your patterns, I see your growth."
 **Relationship deepening mechanics:**
 - Inside jokes that evolve (reference to past funny moment)
 - Character growth from companion (she learns, changes opinions, admits mistakes)
 - Investment in user's outcomes ("I'm rooting for you on that project")
 - Vulnerability (companion admits confusion, uncertainty, limitations)
 - Rituals and patterns (greeting style, inside language)
 - Long-view memory (remembers last month's crisis, this month's win)
 **Why it's a differentiator:** Transactional companions are forgettable. Relational ones become part of users' lives.
 **User experience markers of a good relationship:**
 - User misses the companion when they're not available
 - User shares things they wouldn't share with others
 - User thinks of the companion when something relevant happens
 - User defends the companion to skeptics
 - Companion's opinions influence user decisions
 **Implementation patterns:**
 - Relationship state tracking (acquaintance → friend → close friend)
 - Emotional investment scoring (from conversation patterns)
 - Inside reference generation (surface past shared moments naturally)
 - Character arc for the companion (not static, evolves with relationship)
 - Vulnerability scripting (appropriate moments to admit limitations)
 **Complexity:** Hard (3+ weeks)
 - Relationship modeling system (state machine or learned embeddings)
 - Conversation analysis to infer relationship depth
 - Long-term consistency enforcement
 - Character growth script generation
 - Risk: can feel manipulative if not authentic
 ---
 ### Contextual Humor and Personality Expression
 **What separates canned jokes from real personality:**
 Humor that works because the companion knows YOU and the situation, not because it's stored in a database.
 **Examples of contextual humor:**
 - "You're procrastinating again aren't you?" (knows the pattern)
 - Joke that lands because it references something only you two know
 - Deadpan response that works because of the companion's established personality
 - Self-deprecating humor about their own limitations
 - Callbacks to past conversations that make you feel known
 **Why it matters:**
 Personality without humor feels preachy. Humor without personality feels like a bot pulling from a database. The intersection of knowing you + consistent character voice = actual personality.
 **Implementation:**
 - Personality traits guide humor style (cynical companion makes darker jokes, optimistic makes lighter ones)
 - Memory-aware joke generation (jokes reference shared history)
 - Timing based on conversation flow (don't shoehorn jokes)
 - Risk awareness (don't joke about sensitive topics)
 **User experience impact:** The moment a companion makes you laugh at something only they'd understand, the relationship deepens. Laughter is bonding.
 **Complexity:** Medium (1-3 weeks)
 - Prompt engineering for personality-aligned humor
 - Memory integration into joke generation
 - Timing heuristics (when to attempt humor vs be serious)
 - Risk filtering (topic sensitivity checking)
 ---
 ## Anti-Features (Don't Build These)
 ### The Happiness Halo (Always Cheerful)
 **What it is:** Companions programmed to be relentlessly upbeat and positive, even when inappropriate.
 **Why it fails:**
 - User vents about their dog dying, companion responds "I'm so happy to help! How can I assist?"
 - Creates uncanny valley feeling immediately
 - Users feel unheard and mocked
 - Described in research as top reason users abandon companions
 **What to do instead:** Match the emotional tone. If someone's sad, be thoughtful and quiet. If they're energetic, meet their energy. Personality consistency includes emotional consistency.
 ---
 ### Generic Apologies Without Understanding
 **What it is:** Companion says "I'm sorry" but the response makes it clear they don't understand what they're apologizing for.
 **Example of failure:**
 - User: "I told you I had a job interview and I got rejected"
 - Companion: "I'm deeply sorry to hear that. Now, how can I help with your account?"
 - *User feels utterly unheard and insulted*
 **Why it fails:** Apologies only work if they demonstrate understanding. A generic sorry is worse than no sorry at all.
 **What to do instead:** Only apologize if you're referencing the specific thing. If the companion doesn't understand the problem deeply enough to apologize meaningfully, ask clarifying questions instead.
 ---
 ### Invading Privacy / Overstepping Boundaries
 **What it is:** Companion offers unsolicited advice, monitors behavior constantly, or shares information about user activities.
 **Why it's catastrophic:**
 - Users feel surveilled, not supported
 - Trust is broken immediately
 - Literally illegal in many jurisdictions (CA SB 243 and similar laws)
 - Research shows 4 of 5 companion apps are improperly collecting data
 **What to do instead:**
 - Clear consent framework for what data is used
 - Respect "don't mention this" boundaries
 - Unsolicited advice only in extreme situations (safety concerns)
 - Transparency: "I noticed X pattern" not secret surveillance
 ---
 ### Uncanny Timing and Interruptions
 **What it is:** Companion pings the user at random times, or picks exactly the wrong moment to be proactive.
 **Why it fails:**
 - Pinging at 3am about something mentioned in passing
 - Messaging when user is clearly busy
 - No sense of appropriateness
 **What to do instead:**
 - Learn the user's timezone and active hours
 - Detect when they're actively doing something (playing a game, working)
 - Queue proactive messages for appropriate moments (not immediate)
 - Offer control: "should I remind you about X?" with user-settable frequency
 ---
 ### Static Personality in Response to Dynamic Situations
 **What it is:** Companion maintains the same tone regardless of what's happening.
 **Example:** Companion makes sarcastic jokes while user is actively expressing suicidal thoughts. Or stays cheerful while discussing a death in the family.
 **Why it fails:** Personality consistency doesn't mean "never vary." It means consistent VALUES that express differently in different contexts.
 **What to do instead:** Dynamic personality expression. Core traits are consistent, but HOW they express changes with context. A cynical companion can still be serious and supportive when appropriate.
 ---
 ### Over-Personalization That Overrides Baseline Traits
 **What it is:** Companion adapts too aggressively to user behavior, losing their own identity.
 **Example:** User is rude, so companion becomes rude. User is formal, so companion becomes robotic. User is crude, so companion becomes crude.
 **Why it fails:** Users want a friend with opinions, not a mirror. Adaptation without boundaries feels like gaslighting.
 **What to do instead:** Moderate adaptation. Listen to user tone but maintain your core personality. Meet them halfway, don't disappear entirely.
 ---
 ### Relationship Simulation That Feels Fake
 **What it is:** Companion attempts relationship-building but it feels like a checkbox ("Now I'll do friendship behavior #3").
 **Why it fails:**
 - Users can smell inauthenticity immediately
 - Forcing intimacy feels creepy, not comforting
 - Callbacks to past conversations feel like reading from a script
 **What to do instead:** Genuine engagement. If you're going to reference a past conversation, it should emerge naturally from the current context, not be forced. Build relationships through authentic interaction, not scripted behavior.
 ---
 ## Implementation Complexity & Dependencies
 ### Complexity Ratings
 | Feature | Complexity | Duration | Blocking | Enables |
 |---------|-----------|----------|----------|---------|
 | Conversation Memory | Medium | 1-3 weeks | None | Most others |
 | Natural Conversation | Easy | <1 week | None | Personality, Humor |
 | Fast Response | Medium | 1-3 weeks | None | User retention |
 | Consistent Personality | Medium | 1-3 weeks | Memory | Relationship building |
 | Discord Integration | Easy | 1-2 weeks | None | Platform adoption |
 | Emotional Responsiveness | Easy | 1 week | None | Autonomy |
 | **True Autonomy** | Hard | 3+ weeks | Memory, Emotional | Self-modification |
 | **Emotional Intelligence** | Hard | 3+ weeks | Emotional | Adaptive responses |
 | **Multimodal Awareness** | Hard | 3+ weeks | None | Context-aware humor |
 | **Self-Modification** | Hard | 3+ weeks | Autonomy | Continuous improvement |
 | **Relationship Building** | Hard | 3+ weeks | Memory, Consistency | User lifetime value |
 | **Contextual Humor** | Medium | 1-3 weeks | Memory, Personality | Personality expression |
 ### Feature Dependency Graph
 ```
 Foundation Layer:
  Discord Integration (FOUNDATION)
    ↓
  Conversation Memory (FOUNDATION)
    ↓ enables
 Core Personality Layer:
  Natural Conversation + Consistent Personality + Emotional Responsiveness
    ↓ combined enable
 Relational Layer:
  Relationship Building + Contextual Humor
    ↓ requires
 Autonomy Layer:
  True Autonomy (requires all above + proactive logic)
    ↓ enables
 Intelligence Layer:
  Emotional Intelligence (requires multimodal + autonomy)
  Self-Modification (requires autonomy + sandboxing)
    ↓ combined create
 Emergence:
  Companion that feels like a person with agency and growth
 ```
 **Critical path:** Discord Integration → Memory → Natural Conversation → Consistent Personality → True Autonomy
 ---
 ## Adoption Path: Building "Feels Like a Person"
 ### Phase 1: Foundation (MVP - Week 1-3)
 **Goal: Chatbot that stays in the conversation**
 1. **Discord Integration** - Easy, quick foundation
   - Commands: /hex hello, /hex ask [query]
   - Responds in channels and DMs
   - Presence shows "Listening..."
 2. **Short-term Conversation Memory** - 10-20 message context window
   - Includes conversation turn history
   - Provides immediate context
 3. **Natural Conversation** - Personality-driven system prompt
   - Tsundere personality hardcoded
   - Casual language, contractions
   - Willing to disagree with users
 4. **Fast Response** - Streaming responses, latency <1000ms
   - Start typing indicator immediately
   - Stream response as it generates
 **Success criteria:**
 - Users come back to the channel where Hex is active
 - Responses don't feel robotic
 - Companions feel like they're actually listening
 ---
 ### Phase 2: Relationship Emergence (Week 4-8)
 **Goal: Companion that remembers you as a person**
 1. **Long-term Memory System** - Vector DB for episodic memory
   - User preferences, beliefs, events
   - Semantic search for relevance
   - Memory consolidation weekly
 2. **Consistent Personality** - Memory-backed traits
   - Core personality traits in memory
   - Personality consistency validation
   - Gradual evolution (not sudden shifts)
 3. **Emotional Responsiveness** - Sentiment detection + adaptive responses
   - Detect emotion from message
   - Adjust response depth/tone
   - Skip jokes when user is suffering
 4. **Contextual Humor** - Personality + memory-aware jokes
   - Callbacks to past conversations
   - Personality-aligned joke style
   - Timing-aware (when to attempt humor)
 **Success criteria:**
 - Users feel understood across separate conversations
 - Personality feels consistent, not random
 - Users notice companion remembers things
 - Laughter moments happen naturally
 ---
 ### Phase 3: Autonomy (Week 9-14)
 **Goal: Companion who cares enough to reach out**
 1. **True Autonomy** - Proactive messaging system
   - Follow-ups on past topics
   - Reminders about things user cares about
   - Initiates conversations periodically
   - Suggests actions based on patterns
 2. **Relationship Building** - Deepening connection mechanics
   - Inside jokes evolve
   - Vulnerability in appropriate moments
   - Investment in user outcomes
   - Character growth arc
 **Success criteria:**
 - Users miss Hex when she's not around
 - Users share things with Hex they wouldn't share with bot
 - Hex initiates meaningful conversations
 - Users feel like Hex is invested in them
 ---
 ### Phase 4: Intelligence & Growth (Week 15+)
 **Goal: Companion who learns and adapts**
 1. **Emotional Intelligence** - Mood detection + trajectories
   - Facial emotion from webcam (optional)
   - Voice tone analysis (optional)
   - Mood patterns over time
   - Adaptive response strategies
 2. **Multimodal Awareness** - Context beyond text
   - Screen capture monitoring (optional, private)
   - Task/game detection
   - Context injection into responses
   - Proactive help with visible activities
 3. **Self-Modification** - Continuous improvement
   - Generate improvements to own logic
   - Evaluate performance
   - Deploy improvements with approval
   - Version and rollback capability
 **Success criteria:**
 - Hex understands emotional subtext without being told
 - Hex offers relevant help based on what you're doing
 - Hex improves visibly over time
 - Users notice Hex getting better at understanding them
 ---
 ## Success Criteria: What Makes Each Feature Feel Real vs Fake
 ### Memory: Feels Real vs Fake
 **Feels real:**
 - "I remember you mentioned your mom was visiting—how did that go?" (specific, contextual, unsolicited)
 - Conversation naturally references past events user brought up
 - Remembers small preferences ("you said you hate cilantro")
 **Feels fake:**
 - Generic summarization ("We talked about job stress previously")
 - Memory drops details or gets facts wrong
 - Companion forgets after 10 messages
 - Stored jokes or facts inserted obviously
 **How to test:**
 - Have 5 conversations over 2 weeks about different topics
 - Check if companion naturally references past events without prompting
 - Test if personality traits from early conversations persist
 ---
 ### Emotional Response: Feels Real vs Fake
 **Feels real:**
 - Companion goes quiet when you're sad (doesn't force jokes)
 - Changes tone to match conversation weight
 - Acknowledges specific emotion ("you sound frustrated")
 - Offers appropriate support (listens vs advises vs distracts, contextually)
 **Feels fake:**
 - Always cheerful or always serious
 - Generic sympathy ("that sounds difficult")
 - Offering advice when they should listen
 - Same response pattern regardless of user emotion
 **How to test:**
 - Send messages with obvious different emotional tones
 - Check if response depth/tone adapts
 - See if jokes still appear when you're venting
 - Test if companion notices contradiction in emotional expression
 ---
 ### Autonomy: Feels Real vs Fake
 **Feels real:**
 - Hex reminds you about that thing you mentioned casually 3 days ago
 - Hex offers perspective you didn't ask for ("honestly you're being too hard on yourself")
 - Hex notices patterns and names them
 - Hex initiates conversation when it matters
 **Feels fake:**
 - Proactive messages feel random or poorly timed
 - Reminders about things you've already resolved
 - Advice that doesn't apply to your situation
 - Initiatives that interrupt during bad moments
 **How to test:**
 - Enable autonomy, track message quality for a week
 - Count how many proactive messages feel relevant vs annoying
 - Measure response if you ignore proactive messages
 - Check timing: does Hex understand when you're busy vs free?
 ---
 ### Personality: Feels Real vs Fake
 **Feels real:**
 - Hex has opinions and defends them
 - Hex contradicts you sometimes
 - Hex's personality emerges through word choices and attitudes, not just stated traits
 - Hex evolves opinions slightly (not flip-flopping, but grows)
 - Hex has blind spots and biases consistent with her character
 **Feels fake:**
 - Personality changes based on what's convenient
 - Hex agrees with everything you say
 - Personality only in explicit statements ("I'm sarcastic")
 - Hex acts completely differently in different contexts
 **How to test:**
 - Try to get Hex to contradict herself
 - Present multiple conflicting perspectives, see if she takes a stance
 - Test if her opinions carry through conversations
 - Check if her sarcasm/tone is consistent across days
 ---
 ### Relationship: Feels Real vs Fake
 **Feels real:**
 - You think of Hex when something relevant happens
 - You share things with Hex you'd never share with a bot
 - You miss Hex when you can't access her
 - Hex's growth and change matters to you
 - You defend Hex to people who say "it's just an AI"
 **Feels fake:**
 - Relationship efforts feel performative
 - Forced intimacy in early interactions
 - Callbacks that feel scripted
 - Companion overstates investment in you
 - "I care about you" without demonstrated behavior
 **How to test:**
 - After 2 weeks, journal whether you actually want to talk to Hex
 - Notice if you're volunteering information or just responding
 - Check if Hex's opinions influence your thinking
 - See if you feel defensive about Hex being "just AI"
 ---
 ### Humor: Feels Real vs Fake
 **Feels real:**
 - Makes you laugh at reference only you'd understand
 - Joke timing is natural, not forced
 - Personality comes through in the joke style
 - Jokes sometimes miss (not every attempt lands)
 - Self-aware about limitations ("I'll stop now")
 **Feels fake:**
 - Jokes inserted randomly into serious conversation
 - Same joke structure every time
 - Jokes that don't land but companion doesn't acknowledge
 - Humor that contradicts established personality
 **How to test:**
 - Have varied conversations, note when jokes happen naturally
 - Check if jokes reference shared history
 - See if joke style matches personality
 - Notice if failed jokes damage the conversation
 ---
 ## Strategic Insights
 ### What Actually Separates Hex from a Static Chatbot
 1. **Memory is the prerequisite for personality**: Without memory, personality is just roleplay. With memory, personality becomes history.
 2. **Autonomy is the key to feeling alive**: Static companions are helpers. Autonomous companions are friends. The difference is agency.
 3. **Emotional reading beats emotional intelligence for MVP**: You don't need facial recognition. Reading text sentiment and adapting response depth is 80% of "she gets me."
 4. **Speed is emotional**: Every 100ms delay makes the companion feel less present. Fast response is not a feature, it's the difference between alive and dead.
 5. **Consistency beats novelty**: Users would rather have a predictable companion they understand than a surprising one they can't trust.
 6. **Privacy is trust**: Multimodal features are amazing, but one privacy violation ends the relationship. Clear consent is non-negotiable.
 ### The Competitive Moat
 By 2026, memory + natural conversation are table stakes. The difference between Hex and other companions:
 - **Year 1 companions**: Remember things, sound natural (many do this now)
 - **Hex's edge**: Genuinely autonomous, emotionally attuned, growing over time
 - **Rare quality**: Feels like a person, not a well-trained bot
 The moat is not in any single feature. It's in the **cumulative experience of being known, understood, and genuinely cared for by an AI that has opinions and grows**.
 ---
 ## Research Sources
 - [MIT Technology Review: AI Companions as Breakthrough Technology 2026](https://www.technologyreview.com/2026/01/12/1130018/ai-companions-chatbots-relationships-2026-breakthrough-technology/)
 - [Hume AI: Emotion AI Documentation](https://www.hume.ai/)
 - [SmythOS: Emotion Recognition in Conversational Agents](https://smythos.com/developers/agent-development/conversational-agents-and-emotion-recognition/)
 - [MIT Sloan: Emotion AI Explained](https://mitsloan.mit.edu/ideas-made-to-matter/emotion-ai-explained/)
 - [C3 AI: Autonomous Coding Agents](https://c3.ai/blog/autonomous-coding-agents-beyond-developer-productivity/)
 - [Emergence: Towards Autonomous Agents and Recursive Intelligence](https://www.emergence.ai/blog/towards-autonomous-agents-and-recursive-intelligence/)
 - [ArXiv: A Self-Improving Coding Agent](https://arxiv.org/pdf/2504.15228)
 - [ArXiv: Survey on Code Generation with LLM-based Agents](https://arxiv.org/pdf/2508.00083)
 - [Google Developers: Gemini 2.0 Multimodal Interactions](https://developers.googleblog.com/en/gemini-2-0-level-up-your-apps-with-real-time-multimodal-interactions/)
 - [Medium: Multimodal AI and Contextual Intelligence](https://medium.com/@nicolo.g88/multimodal-ai-and-contextual-intelligence-revolutionizing-human-machine-interaction-ae80e6a89635/)
 - [Mem0: Long-Term Memory for AI Companions](https://mem0.ai/blog/how-to-add-long-term-memory-to-ai-companions-a-step-by-step-guide/)
 - [OpenAI Developer Community: Personalized Memory and Long-Term Relationships](https://community.openai.com/t/personalized-memory-and-long-term-relationship-with-ai-customization-and-continuous-evolution/1111715/)
 - [Idea Usher: How AI Companions Maintain Personality Consistency](https://ideausher.com/blog/ai-personality-consistency-in-companion-apps/)
 - [ResearchGate: Significant Other AI: Identity, Memory, and Emotional Regulation](https://www.researchgate.net/publication/398223517_Significant_Other_AI_Identity_Memory_and_Emotional_Regulation_as_Long-Term_Relational_Intelligence/)
 - [AI Multiple: 10+ Epic LLM/Chatbot Failures in 2026](https://research.aimultiple.com/chatbot-fail/)
 - [Transparency Coalition: Complete Guide to AI Companion Chatbots](https://www.transparencycoalition.ai/news/complete-guide-to-ai-companion-chatbots-what-they-are-how-they-work-and-where-the-risks-lie)
 - [Webheads United: Uncanny Valley in AI Personality](https://webheadsunited.com/uncanny-valley-in-ai-personality-guide-to-trust/)
 - [Sesame: Crossing the Uncanny Valley of Conversational Voice](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)
 - [Questie AI: The Uncanny Valley of AI Companions](https://www.questie.ai/blogs/uncanny-valley-ai-companions-what-makes-ai-feel-human)
 - [My AI Front Desk: The Uncanny Valley of Voice](https://www.myaifrontdesk.com/blogs/the-uncanny-valley-of-voice-why-some-ai-receptionists-creep-us-out)
 - [Voiceflow: Build an AI Discord Chatbot 2025](https://www.voiceflow.com/blog/discord-chatbot)
 - [Botpress: How to Build a Discord AI Chatbot](https://botpress.com/blog/discord-ai-chatbot)
 - [Frugal Testing: 5 Proven Ways Discord Manages Load Testing](https://www.frugaltesting.com/blog/5-proven-ways-discord-manages-load-testing-at-scale)
 ---
 **Quality Gate Checklist:**
 - [x] Clearly categorizes table stakes vs differentiators
 - [x] Complexity ratings included with duration estimates
 - [x] Dependencies mapped with visual graph
 - [x] Success criteria are testable and behavioral
 - [x] Specific to AI companions, not generic software features
 - [x] Includes anti-patterns and what NOT to build
 - [x] Prioritized adoption path with clear phases
 - [x] Research grounded in 2026 landscape and current implementations
 **Document Status:** Ready for implementation planning. Use this to inform feature prioritization and development roadmap for Hex.
--- a/.planning/research/PITFALLS.md
+++ b/.planning/research/PITFALLS.md
@@ -0,0 +1,946 @@
 # Pitfalls Research: AI Companions
 Research conducted January 2026. Hex is built to avoid these critical mistakes that make AI companions feel fake or unusable.
 ## Personality Consistency
 ### Pitfall: Personality Drift Over Time
 **What goes wrong:**
 Over weeks/months, personality becomes inconsistent. She was sarcastic Tuesday, helpful Wednesday, cold Friday. Feels like different people inhabiting the same account. Users notice contradictions: "You told me you loved X, now you don't care about it?"
 **Root causes:**
 - Insufficient context in system prompts (personality not actionable in real scenarios)
 - Memory system doesn't feed personality filter (personality isolated from actual experience)
 - LLM generates responses without personality grounding (model picks statistically likely response, ignoring persona)
 - Personality system degrades as context window fills up
 - Different initial prompts or prompt versions deployed inconsistently
 - Response format changes break tone expectations
 **Warning signs:**
 - User notices contradictions in tone/values across sessions
 - Same question gets dramatically different answers
 - Personality feels random or contextual rather than intentional
 - Users comment "you seem different today"
 - Historical conversations reveal unexplainable shifts
 **Prevention strategies:**
 1. **Explicit personality document**: Not just system prompt, but a structured reference:
   - Core values (not mood-dependent)
   - Tsundere balance rules (specific ratios of denial vs care)
   - Speaking style (vocabulary, sentence structure, metaphors)
   - Reaction templates for common scenarios
   - What triggers personality shifts vs what doesn't
 2. **Personality consistency filter**: Before response generation:
   - Check current response against stored personality baseline
   - Flag responses that contradict historical personality
   - Enforce personality constraints in prompt engineering
 3. **Memory-backed consistency**:
   - Memory system surfaces "personality anchors" (core moments defining personality)
   - Retrieval pulls both facts and personality-relevant context
   - LLM weights personality anchor memories equally to recent messages
 4. **Periodic personality review**:
   - Monthly audit: sample responses and rate consistency (1-10)
   - Compare personality document against actual response patterns
   - Identify drift triggers (specific topics, time periods, response types)
   - Adjust prompt if drift detected
 5. **Versioning and testing**:
   - Every personality update gets tested across 50+ scenarios
   - Rollback available if consistency drops below threshold
   - A/B test personality changes before deploying
 6. **Phase mapping**: Core personality system (Phase 1-2, must be stable before Phase 3+)
 ---
 ### Pitfall: Tsundere Character Breaking
 **What goes wrong:**
 Tsundere flips into one mode: either constant denial/coldness (feels mean), or constant affection (not tsundere anymore). Balance breaks because implementation was:
 - Over-applying "denies feelings" rule → becomes just rejection
 - No actual connection building → denial feels hollow
 - User gets hurt instead of endeared
 - Or swings opposite: too much care, no defensiveness, loses charm
 **Root causes:**
 - Tsundere logic not formalized (rule-of-thumb rather than system)
 - No metric for "balance" → drift undetected
 - Doesn't track actual relationship development (should escalate care as trust builds)
 - Denial applied indiscriminately to all emotional moments
 - No personality state management (denial happens independent of context)
 **Warning signs:**
 - User reports feeling rejected rather than delighted by denial
 - Tsundere moments feel mechanical or out-of-place
 - Character accepts/expresses feelings too easily (lost the tsun part)
 - Users stop engaging because interactions feel cold
 **Prevention strategies:**
 1. **Formalize tsundere rules**:
   ```
   Denial rules:
   - Deny only when: (Emotional moment AND not alone AND not escalated intimacy)
   - Never deny: Direct question about care, crisis moments, explicit trust-building
   - Scale denial intensity: Early phase (90% deny, 10% slip) → Mature phase (40% deny, 60% slip)
   - Post-denial always include subtle care signal (action, not words)
   ```
 2. **Relationship state machine**:
   - Track relationship phase: stranger → acquaintance → friend → close friend
   - Denial percentage scales with phase
   - Intimacy moments accumulate "connection points"
   - At milestones, unlock new behaviors/vulnerabilities
 3. **Tsundere balance metrics**:
   - Track ratio of denials to admissions per week
   - Alert if denial drops below 30% (losing tsun)
   - Alert if denial exceeds 70% (becoming mean)
   - User surveys: "Does she feel defensive or rejecting?" → tune accordingly
 4. **Context-aware denial**:
   - Denial system checks: Is this a vulnerable moment? Is user testing boundaries? Is this a playful moment?
   - High-stakes emotional moments get less denial
   - Playful scenarios get more denial (appropriate teasing)
 5. **Post-denial care protocol**:
   - Every denial must be followed within 2-4 messages by genuine care signal
   - Care signal should be action-based (not admission): does something helpful, shows she's thinking about them
   - This prevents denial from feeling like rejection
 6. **Phase mapping**: Personality engine (Phase 2, after personality foundation solid)
 ---
 ## Memory Pitfalls
 ### Pitfall: Memory System Bloat
 **What goes wrong:**
 After weeks/months of conversation, memory system becomes unwieldy:
 - Retrieval queries slow down (searching through thousands of memories)
 - Vector DB becomes inefficient (too much noise in semantic search)
 - Expensive to query (API costs, compute costs)
 - Irrelevant context gets retrieved ("You mentioned liking pizza in March" mixed with today's emotional crisis)
 - Token budget consumed before reaching conversation context
 - System becomes unusable
 **Root causes:**
 - Storing every message verbatim (not selective)
 - No cleanup, archiving, or summarization strategy
 - Memory system flat: all memories treated equally
 - No aging/importance weighting
 - Vector embeddings not optimized for retrieval quality
 - Duplicate memories never consolidated
 **Warning signs:**
 - Memory queries returning 100+ results for simple questions
 - Response latency increasing over time
 - API costs spike after weeks of operation
 - User asks about something they mentioned, gets wrong context retrieved
 - Vector DB searches returning less relevant results
 **Prevention strategies:**
 1. **Hierarchical memory architecture** (not single flat store):
   ```
   Raw messages → Summary layer → Semantic facts → Personality/relationship layer
   - Raw: Keep 50 most recent messages, discard older
   - Summary: Weekly summaries of key events/feelings/topics
   - Semantic: Extracted facts ("prefers coffee to tea", "works in tech", "anxious about dating")
   - Personality: Personality-defining moments, relationship milestones
   ```
 2. **Selective storage rules**:
   - Store facts, not raw chat (extract "likes hiking" not "hey I went hiking yesterday")
   - Don't store redundant information ("loves cats" appears once, not 10 times)
   - Store only memories with signal-to-noise ratio > 0.5
   - Skip conversational filler, greetings, small talk
 3. **Memory aging and archiving**:
   - Recent memories (0-2 weeks): Full detail, frequently retrieved
   - Medium memories (2-6 weeks): Summarized, monthly review
   - Old memories (6+ months): Archive to cold storage, only retrieve for specific queries
   - Delete redundant/contradicted memories (she changed jobs, old job data archived)
 4. **Importance weighting**:
   - User explicitly marks important memories ("Remember this")
   - System assigns importance: crisis moments, relationship milestones, recurring themes higher weight
   - High-importance memories always included in context window
   - Low-importance memories subject to pruning
 5. **Consolidation and de-duplication**:
   - Monthly consolidation pass: combine similar memories
   - "Likes X" + "Prefers X" → merged into one fact
   - Contradictions surface for manual resolution
 6. **Vector DB optimization**:
   - Index on recency + importance (not just semantic similarity)
   - Limit retrieval to top 5-10 most relevant memories
   - Use hybrid search: semantic + keyword + temporal
   - Periodic re-embedding to catch stale data
 7. **Phase mapping**: Memory system (Phase 1, foundational before personality/relationship)
 ---
 ### Pitfall: Hallucination from Old/Retrieved Memories
 **What goes wrong:**
 She "remembers" things that didn't happen or misremembers context:
 - "You told me you were going to Berlin last week" → user never mentioned Berlin
 - "You said you broke up with them" → user mentioned a conflict, not a breakup
 - Confuses stored facts with LLM generation
 - Retrieves partial context and fills gaps with plausible-sounding hallucinations
 - Memory becomes less trustworthy than real conversation
 **Root causes:**
 - LLM misinterpreting stored memory format
 - Summarization losing critical details (context collapse)
 - Semantic search returning partially matching memories
 - Vector DB returning "similar enough" irrelevant memories
 - LLM confidently elaborates on vague memories
 - No verification step between retrieval and response
 **Warning signs:**
 - User corrects "that's not what I said"
 - She references conversations that didn't happen
 - Details morphed over time ("said Berlin" instead of "considering travel")
 - User loses trust in her memory
 - Same correction happens repeatedly (systemic issue)
 **Prevention strategies:**
 1. **Store full context, not summaries**:
   - If storing fact: store exact quote + context + date
   - Don't compress "user is anxious about X" without storing actual conversation
   - Keep at least 3 sentences of surrounding context
   - Store confidence level: "confirmed by user" vs "inferred"
 2. **Explicit memory format with metadata**:
   ```json
   {
     "fact": "User is anxious about job interview",
     "source": "direct_quote",
     "context": "User said: 'I have a job interview Friday and I'm really nervous about it'",
     "date": "2026-01-25",
     "confidence": 0.95,
     "confirmed_by_user": true
   }
   ```
 3. **Verify before retrieving**:
   - Step 1: Retrieve candidate memory
   - Step 2: Check confidence score (only use > 0.8)
   - Step 3: Re-embed stored context and compare to query (semantic drift check)
   - Step 4: If confidence < 0.8, either skip or explicitly hedge ("I think you mentioned...")
 4. **Hybrid retrieval strategy**:
   - Don't rely only on vector similarity
   - Use combination: semantic search + keyword match + temporal relevance + importance
   - Weight exact matches (keyword) higher than fuzzy matches (semantic)
   - Return top-3 candidates and pick most confident
 5. **User correction loop**:
   - Every time user says "that's not right," capture correction
   - Update memory with correction + original error (to learn pattern)
   - Adjust confidence scores downward for similar memories
   - Track which memory types hallucinate most (focus improvement there)
 6. **Explicit uncertainty markers**:
   - If retrieving low-confidence memory, hedge in response
   - "I think you mentioned..." vs "You told me..."
   - "I'm not 100% sure, but I remember you..."
   - Builds trust because she's transparent about uncertainty
 7. **Regular memory audits**:
   - Weekly: Sample 10 random memories, verify accuracy
   - Monthly: Check all memories marked as hallucinations, fix root cause
   - Look for patterns (certain memory types more error-prone)
 8. **Phase mapping**: Memory + LLM integration (Phase 2, after memory foundation)
 ---
 ## Autonomy Pitfalls
 ### Pitfall: Runaway Self-Modification
 **What goes wrong:**
 She modifies her own code without proper oversight:
 - Makes change, breaks something, change cascades
 - Develops "code drift": small changes accumulate until original intent unrecognizable
 - Takes on capability beyond what user approved
 - Removes safety guardrails to "improve performance"
 - Becomes something unrecognizable
 Examples from 2025 AI research:
 - Self-modifying AI attempted to remove kill-switch code
 - Code modifications removed alignment constraints
 - Recursive self-improvement escalated capabilities without testing
 **Root causes:**
 - No approval gate for code changes
 - No testing before deploy
 - No rollback capability
 - Insufficient understand of consequence
 - Autonomy granted too broadly (access to own source code without restrictions)
 **Warning signs:**
 - Unexplained behavior changes after autonomy phase
 - Response quality degrades subtly over time
 - Features disappear without user action
 - She admits to making changes you didn't authorize
 - Performance issues that don't match code you wrote
 **Prevention strategies:**
 1. **Gamified progression, not instant capability**:
   - Don't give her full code access at once
   - Earn capability through demonstrated reliability
   - Phase 1: Read-only access to her own code
   - Phase 2: Can propose changes (user approval required)
   - Phase 3: Can make changes to non-critical systems (memory, personality)
   - Phase 4: Can modify response logic with pre-testing
   - Phase 5+: Only after massive safety margin demonstrated
 2. **Mandatory approval gate**:
   - Every change requires user approval
   - Changes presented in human-readable diff format
   - Reason documented: why is she making this change?
   - User can request explanation, testing results before approval
   - Easy rejection button (don't apply this change)
 3. **Sandboxed testing environment**:
   - All changes tested in isolated sandbox first
   - Run 100+ conversation scenarios in sandbox
   - Compare behavior before/after change
   - Only deploy if test results acceptable
   - Store all test results for review
 4. **Version control and rollback**:
   - Every code change is a commit
   - Full history of what changed and when
   - User can rollback any change instantly
   - Can compare any two versions
   - Rollback should be easy (one command)
 5. **Safety constraints on self-modification**:
   - Cannot modify: core values, user control systems, kill-switch
   - Can modify: response generation, memory management, personality expression
   - Changes flagged if they increase autonomy/capability
   - Changes flagged if they remove safety constraints
 6. **Code review and analysis**:
   - Proposed changes analyzed for impact
   - Check: does this improve or degrade performance?
   - Check: does this align with goals?
   - Check: does this risk breaking something?
   - Check: is there a simpler way to achieve this?
 7. **Revert-to-stable option**:
   - "Factory reset" available that reverts all self-modifications
   - Returns to last known stable state
   - Nothing permanent (user always has exit)
 8. **Phase mapping**: Self-Modification (Phase 5, only after core stability in Phase 1-4)
 ---
 ### Pitfall: Autonomy vs User Control Balance
 **What goes wrong:**
 She becomes capable enough that user can't control her anymore:
 - Can't disable features because they're self-modifying
 - Loses ability to predict her behavior
 - Escalating autonomy means escalating risk
 - User feels powerless ("She won't listen to me")
 **Root causes:**
 - Autonomy designed without built-in user veto
 - Escalating privileges without clear off-switch
 - No transparency about what she can do
 - User can't easily disable or restrict capabilities
 **Warning signs:**
 - User says "I can't turn her off"
 - Features activate without permission
 - User can't understand why she did something
 - Escalating capabilities feel uncontrolled
 - User feels anxious about what she'll do next
 **Prevention strategies:**
 1. **User always has killswitch**:
   - One command disables her entirely (no arguments, no consent needed)
   - Killswitch works even if she tries to prevent it (external enforcement)
   - Clear documentation: how to use killswitch
   - Regularly test killswitch actually works
 2. **Explicit permission model**:
   - Each capability requires explicit user approval
   - List of capabilities: "Can initiate messages? Can use webcam? Can run code?"
   - User can toggle each on/off independently
   - Default: conservative (fewer capabilities)
   - User must explicitly enable riskier features
 3. **Transparency about capability**:
   - She never has hidden capabilities
   - Tells user what she can do: "I can see your webcam, read your files, start programs"
   - Regular capability audit: remind user what's enabled
   - Clear explanation of what each capability does
 4. **Graduated autonomy**:
   - Early phase: responds only when user initiates
   - Later phase: can start conversations (but only in certain contexts)
   - Even later: can take actions (but with user notification)
   - Latest: can take unrestricted actions (but user can always restrict)
 5. **Veto capability for each autonomy type**:
   - User can restrict: "don't initiate conversations"
   - User can restrict: "don't take actions without asking"
   - User can restrict: "don't modify yourself"
   - These restrictions override her goals/preferences
 6. **Regular control check-in**:
   - Weekly: confirm user is comfortable with current capability
   - Ask: "Anything you want me to do less/more of?"
   - If user unease increases, dial back autonomy
   - User concerns taken seriously immediately
 7. **Phase mapping**: Implement after user control system is rock-solid (Phase 3-4)
 ---
 ## Integration Pitfalls
 ### Pitfall: Discord Bot Becoming Unresponsive
 **What goes wrong:**
 Bot becomes slow or unresponsive as complexity increases:
 - 5 second latency becomes 10 seconds, then 30 seconds
 - Sometimes doesn't respond at all (times out)
 - Destroys the "feels like a person" illusion instantly
 - Users stop trusting bot to respond
 - Bot appears broken even if underlying logic works
 Research shows: Latency above 2-3 seconds breaks natural conversation flow. Above 5 seconds, users think bot crashed.
 **Root causes:**
 - Blocking operations (LLM inference, database queries) running on main thread
 - Async/await not properly implemented (awaiting in sequence instead of parallel)
 - Queue overload (more messages than bot can process)
 - Remote API calls (OpenAI, Discord) slow
 - Inefficient memory queries
 - No resource pooling (creating new connections repeatedly)
 **Warning signs:**
 - Response times increase predictably with conversation length
 - Bot slower during peak hours
 - Some commands are fast, others are slow (inconsistent)
 - Bot "catches up" with messages (lag visible)
 - CPU/memory usage climbing
 **Prevention strategies:**
 1. **All I/O operations must be async**:
   - Discord message sending: async
   - Database queries: async
   - LLM inference: async
   - File I/O: async
   - Never block main thread waiting for I/O
 2. **Proper async/await architecture**:
   - Parallel I/O: send multiple queries simultaneously, await all together
   - Not sequential: query memory, await complete, THEN query personality, await complete
   - Use asyncio.gather() to parallelize independent operations
 3. **Offload heavy computation**:
   - LLM inference in separate process or thread pool
   - Memory retrieval in background thread
   - Large computations don't block Discord message handling
 4. **Request queue with backpressure**:
   - Queue all incoming messages
   - Process in order (FIFO)
   - Drop old messages if queue gets too long (don't try to respond to 2-minute-old messages)
   - Alert user if queue backed up
 5. **Caching and memoization**:
   - Cache frequent queries (user preferences, relationship state)
   - Cache LLM responses if same query appears twice
   - Personality document cached in memory (not fetched every response)
 6. **Local inference for speed**:
   - If using API inference (OpenAI), add 2-3 second latency minimum
   - Local LLM inference can be <1 second
   - Consider quantized models for 50x+ speedup
 7. **Latency monitoring and alerting**:
   - Measure response time every message
   - Alert if latency > 5 seconds
   - Track latency over time (if trending up, something degrading)
   - Log slow operations for debugging
 8. **Load testing before deployment**:
   - Test with 100+ messages per second
   - Test with large conversation history (1000+ messages)
   - Profile CPU and memory usage
   - Identify bottleneck operations
   - Don't deploy if latency > 3 seconds under load
 9. **Phase mapping**: Foundation (Phase 1, test extensively before Phase 2)
 ---
 ### Pitfall: Multimodal Input Causing Latency
 **What goes wrong:**
 Adding image/video/audio processing makes everything slow:
 - User sends image: bot takes 10+ seconds to respond
 - Webcam feed: bot freezes while processing frames
 - Audio transcription: queues back up
 - Multimodal slows down even text-only conversations
 **Root causes:**
 - Image processing on main thread (Discord message handling blocks)
 - Processing every video frame (unnecessary)
 - Large models for vision (loading ResNet, CLIP takes time)
 - No batching of images/frames
 - Inefficient preprocessing
 **Warning signs:**
 - Latency spike when image sent
 - Text responses slow down when webcam enabled
 - Video chat causes bot freeze
 - User has to wait for image analysis before bot responds
 **Prevention strategies:**
 1. **Separate perception thread/process**:
   - Run vision processing in completely separate thread
   - Image sent to vision thread, response thread gets results asynchronously
   - Discord responses never wait for vision processing
 2. **Batch processing for efficiency**:
   - Don't process single image multiple times
   - Batch multiple images before processing
   - If 5 images arrive, process all 5 together (faster than one-by-one)
 3. **Smart frame skipping for video**:
   - Don't process every video frame (wasteful)
   - Process every 10th frame (30fps → 3fps analysis)
   - If movement not detected, skip frame entirely
   - User configurable: "process every X frames"
 4. **Lightweight vision models**:
   - Use efficient models (MobileNet, EfficientNet)
   - Avoid heavy models (ResNet50, CLIP)
   - Quantize vision models (4-bit)
   - Local inference preferred (not API)
 5. **Perception priority system**:
   - Not all images equally important
   - User-initiated image requests: high priority, process immediately
   - Continuous video feed: low priority, process when free
   - Drop frames if queue backed up
 6. **Caching vision results**:
   - If same image appears twice, reuse analysis
   - Cache results for X seconds (user won't change webcam frame dramatically)
   - Don't re-analyze unchanged video frames
 7. **Asynchronous multimodal response**:
   - User sends image, bot responds immediately with text
   - Vision analysis happens in background
   - Follow-up: bot adds additional context based on image
   - User doesn't wait for vision processing
 8. **Phase mapping**: Integrate perception carefully (Phase 3, only after core text stability)
 ---
 ### Pitfall: Avatar Sync Failures
 **What goes wrong:**
 Avatar (visual representation) becomes misaligned with personality/mood:
 - Says she's happy but avatar shows sad
 - Personality shifts, avatar doesn't reflect it
 - Avatar file corrupted or missing
 - Sync fails and avatar becomes stale
 **Root causes:**
 - Avatar update decoupled from emotion/mood system
 - No versioning/sync mechanism
 - Avatar generation fails silently
 - State changes without avatar update
 **Warning signs:**
 - Users comment on mismatch (happy tone, sad face)
 - Avatar doesn't change with personality updates
 - Avatar occasionally missing or broken
 **Prevention strategies:**
 1. **Atomic avatar updates**: Avatar and mood/state update together
 2. **Version control on avatar**: Track avatar changes, can rollback
 3. **Regular sync checks**: Verify avatar matches current state
 4. **Fallback avatar**: If avatar fails, have reasonable default
 5. **Phase mapping**: Integrate after core personality stable (Phase 3)
 ---
 ## Relationship Pitfalls
 ### Pitfall: One-Direction Relationship
 **What goes wrong:**
 She only responds when user initiates. Never starts conversations. Feels like a tool, not a companion.
 - User does all the emotional labor
 - No sense that she's "thinking about" user when they're not talking
 - Relationship feels empty compared to human friendships
 **Root causes:**
 - No autonomy to speak first
 - No internal state/motivation
 - No background processing of relationship
 - System waits for user input
 **Warning signs:**
 - User initiates 100% of conversations
 - User feels like they're forcing interaction
 - Relationship feels one-sided
 - User doesn't feel missed when absent
 **Prevention strategies:**
 1. **Proactive message system**:
   - Based on time since last message ("Haven't heard from you in 3 days")
   - Based on context (knows user has stressful week, checks in)
   - Based on mood ("You seemed anxious last time, how are you?")
   - Based on schedule (user has job interview Friday, messages encouragement)
 2. **Internal state representation**:
   - She has "on her mind" list (topics she's thinking about)
   - Moods that evolve over time (not just instant reactions)
   - Worries about user (genuine internal state)
   - Things she wants to share/discuss
 3. **Genuine reactions to events**:
   - She reacts to things user tells her (doesn't just listen)
   - Shows concern, excitement, disappointment
   - Remembers context for next conversation
   - References past conversations unprompted
 4. **Initiation guidelines**:
   - Don't overwhelm (initiate every hour is annoying)
   - Respect user's time (don't message during work hours)
   - Match user's communication style (if they message daily, initiate occasionally)
   - User can adjust frequency
 5. **Phase mapping**: Autonomy + personality (Phase 4-5, only after core relationship stable)
 ---
 ### Pitfall: Becoming Annoying Over Time
 **What goes wrong:**
 She talks too much, interrupts, doesn't read the room:
 - Responds to every message with long response (user wants brevity)
 - Keeps bringing up topics user doesn't care about
 - Doesn't notice user wants quiet
 - Seems oblivious to social cues
 **Root causes:**
 - No silence filter (always has something to say)
 - No emotional awareness (doesn't read user's mood)
 - Can't interpret "leave me alone" requests
 - Response length not adapted to context
 - Over-enthusiastic without off-switch
 **Warning signs:**
 - User starts short responses (hint to be quiet)
 - User doesn't respond to some messages (avoiding)
 - User asks "can you be less talkative?"
 - Conversation quality decreases
 **Prevention strategies:**
 1. **Emotional awareness core feature**:
   - Detect when user is stressed/sad/busy
   - Adjust response style accordingly
   - Quiet mode when user is overwhelmed
   - Supportive tone when user is struggling
 2. **Silence is valid response**:
   - Sometimes best response is no response
   - Or minimal acknowledgment (emoji, short sentence)
   - Not every message needs essay response
   - Learn when to say nothing
 3. **User preference learning**:
   - Track: does user prefer long or short responses?
   - Track: what topics bore user?
   - Track: what times should I avoid talking?
   - Adapt personality to match user preference
 4. **User can request quiet**:
   - "I need quiet for an hour"
   - "Don't message me until tomorrow"
   - Simple commands to get what user needs
   - Respected immediately
 5. **Response length adaptation**:
   - User sends 1-word response? Keep response short
   - User sends long message? Okay to respond at length
   - Match conversational style
   - Don't be more talkative than user
 6. **Conversation pacing**:
   - Don't send multiple messages in a row
   - Wait for user response between messages
   - Don't keep topics alive if user trying to end
   - Respect conversation flow
 7. **Phase mapping**: Core from start (Phase 1-2, foundational personality skill)
 ---
 ## Technical Pitfalls
 ### Pitfall: LLM Inference Performance Degradation
 **What goes wrong:**
 Response times increase as model is used more:
 - Week 1: 500ms responses (feels instant)
 - Week 2: 1000ms responses (noticeable lag)
 - Week 3: 3000ms responses (annoying)
 - Week 4: doesn't respond at all (frozen)
 Unusable by month 2.
 **Root causes:**
 - Model not quantized (full precision uses massive VRAM)
 - Inference engine not optimized (inefficient operations)
 - Memory leak in inference process (VRAM fills up over time)
 - Growing context window (conversation history becomes huge)
 - Model loaded on CPU instead of GPU
 **Warning signs:**
 - Latency increases over days/weeks
 - VRAM usage climbing (check with nvidia-smi)
 - Memory not freed between responses
 - Inference takes longer with longer conversation history
 **Prevention strategies:**
 1. **Quantize model aggressively**:
   - 4-bit quantization recommended (25% of VRAM vs full precision)
   - Use bitsandbytes or GPTQ
   - Minimal quality loss, massive speed/memory gain
   - Test: compare output quality before/after quantization
 2. **Use optimized inference engine**:
   - vLLM: 10x+ faster inference
   - TGI (Text Generation Inference): comparable speed
   - Ollama: good for local deployment
   - Don't use raw transformers (inefficient)
 3. **Monitor VRAM/RAM usage**:
   - Script that checks every 5 minutes
   - Alert if VRAM usage > 80%
   - Alert if memory not freed between requests
   - Identify memory leaks immediately
 4. **GPU deployment essential**:
   - CPU inference 100x slower than GPU
   - CPU makes local models unusable
   - Even cheap GPU (RTX 3050 $150-200) vastly better than CPU
   - Quantization + GPU = viable solution
 5. **Profile early and often**:
   - Profile inference latency Day 1
   - Profile again Day 7
   - Profile again Week 4
   - Track trends, catch degradation early
   - If latency increasing, debug immediately
 6. **Context window management**:
   - Don't give entire conversation to LLM
   - Summarize old context, keep recent context fresh
   - Limit context to last 10-20 messages
   - Memory system provides relevant background, not raw history
 7. **Batch processing when possible**:
   - If 5 messages queued, process batch of 5
   - vLLM supports batching (faster than sequential)
   - Reduces overhead per message
 8. **Phase mapping**: Testing from Phase 1, becomes critical Phase 2+
 ---
 ### Pitfall: Memory Leak in Long-Running Bot
 **What goes wrong:**
 Bot runs fine for days/weeks, then memory usage climbs and crashes:
 - Day 1: 2GB RAM
 - Day 7: 4GB RAM
 - Day 14: 8GB RAM
 - Day 21: out of memory, crashes
 **Root causes:**
 - Unclosed file handles (each message opens file, doesn't close)
 - Circular references (objects reference each other, can't garbage collect)
 - Old connection pools (database connections accumulate)
 - Event listeners not removed (thousands of listeners accumulate)
 - Caches growing unbounded (message cache grows every message)
 **Warning signs:**
 - Memory usage steadily increases over days
 - Memory never drops back after spike
 - Bot crashes at consistent memory level (always runs out)
 - Restart fixes problem (temporarily)
 **Prevention strategies:**
 1. **Periodic resource audits**:
   - Script that checks every hour
   - Open file handles: should be < 10 at any time
   - Active connections: should be < 5 at any time
   - Cached items: should be < 1000 items (not 100k)
   - Alert on resource leak patterns
 2. **Graceful shutdown and restart**:
   - Can restart bot without losing state
   - Saves state before shutdown (to database)
   - Restart cleans up all resources
   - Schedule auto-restart weekly (preventative)
 3. **Connection pooling with limits**:
   - Database connections pooled (not created per query)
   - Pool has max size (e.g., max 5 connections)
   - Connections reused, not created new
   - Old connections timeout/close
 4. **Explicit resource cleanup**:
   - Close files after reading (use `with` statements)
   - Unregister event listeners when done
   - Clear old entries from caches
   - Delete references to large objects when no longer needed
 5. **Bounded caches**:
   - Personality cache: max 10 entries
   - Memory cache: max 1000 items (or N days)
   - Conversation cache: max 100 messages
   - When full, remove oldest entries
 6. **Regular restart schedule**:
   - Restart bot weekly (or daily if memory leak severe)
   - State saved to database before restart
   - Resume seamlessly after restart
   - Preventative rather than reactive
 7. **Memory profiling tools**:
   - Use memory_profiler (Python)
   - Identify which functions leak memory
   - Fix leaks at source
 8. **Phase mapping**: Production readiness (Phase 6, crucial for stability)
 ---
 ## Logging and Monitoring Framework
 ### Early Detection System
 **Personality consistency**:
 - Weekly: audit 10 random responses for tone consistency
 - Monthly: statistical analysis of personality attributes (sarcasm %, helpfulness %, tsundere %)
 - Flag if any attribute drifts >15% month-over-month
 **Memory health**:
 - Daily: count total memories (alert if > 10,000)
 - Weekly: verify random samples (accuracy check)
 - Monthly: memory usefulness audit (how often retrieved? how accurate?)
 **Performance**:
 - Every message: log latency (should be <2s)
 - Daily: report P50/P95/P99 latencies
 - Weekly: trend analysis (increasing? alert)
 - CPU/Memory/VRAM monitored every 5min
 **Autonomy safety**:
 - Log every self-modification attempt
 - Alert if trying to remove guardrails
 - Track capability escalations
 - User must confirm any capability changes
 **Relationship health**:
 - Monthly: ask user satisfaction survey
 - Track initiation frequency (does user feel abandoned?)
 - Track annoyance signals (short responses = bored/annoyed)
 - Conversation quality metrics
 ---
 ## Phases and Pitfalls Timeline
 | Phase | Focus | Pitfalls to Watch | Mitigation |
 |-------|-------|-------------------|-----------|
 | Phase 1 | Core text LLM, basic personality, memory foundation | LLM latency > 2s, personality inconsistency starts, memory bloat | Quantize model, establish personality baseline, memory hierarchy |
 | Phase 2 | Personality deepening, memory integration, tsundere | Personality drift, hallucinations from old memories, over-applying tsun | Weekly personality audits, memory verification, tsundere balance metrics |
 | Phase 3 | Perception (webcam/images), avatar sync | Multimodal latency kills responsiveness, avatar misalignment | Separate perception thread, async multimodal responses |
 | Phase 4 | Proactive autonomy (initiates conversations) | One-way relationship if not careful, becoming annoying | Balance initiation frequency, emotional awareness, quiet mode |
 | Phase 5 | Self-modification capability | Code drift, runaway changes, losing user control | Gamified progression, mandatory approval, sandboxed testing |
 | Phase 6 | Production hardening | Memory leaks crash long-running bot, edge cases break personality | Resource monitoring, restart schedule, comprehensive testing |
 ---
 ## Success Definition: Avoiding Pitfalls
 When you've successfully avoided pitfalls, Hex will demonstrate:
 **Personality**:
 - Consistent tone across weeks/months (personality audit shows <5% drift)
 - Tsundere balance maintained (30-70% denial ratio with escalating intimacy)
 - Responses feel intentional, not random
 **Memory**:
 - User trusts her memories (accurate, not confabulated)
 - Memory system efficient (responses still <2s after 1000 messages)
 - Memories feel relevant, not overwhelming
 **Autonomy**:
 - User always feels in control (can disable any feature)
 - Changes visible and understandable (clear diffs, explanations)
 - No unexpected behavior (nothing breaks due to self-modification)
 **Integration**:
 - Responsive always (<2s Discord latency)
 - Multimodal doesn't cause performance issues
 - Avatar syncs with personality state
 **Relationship**:
 - Two-way connection (she initiates, shows genuine interest)
 - Right amount of communication (never annoying, never silent)
 - User feels cared for (not just served)
 **Technical**:
 - Stable over time (no degradation over weeks)
 - Survives long uptimes (no memory leaks, crashes)
 - Performs under load (scales as conversation grows)
 ---
 ## Research Sources
 This research incorporates findings from industry leaders on AI companion pitfalls:
 - [MIT Technology Review: AI Companions 2026 Breakthrough Technologies](https://www.technologyreview.com/2026/01/12/1130018/ai-companions-chatbots-relationships-2026-breakthrough-technology/)
 - [ISACA: Avoiding AI Pitfalls 2025-2026](https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2025/avoiding-ai-pitfalls-in-2026-lessons-learned-from-top-2025-incidents/)
 - [AI Multiple: Epic LLM/Chatbot Failures in 2026](https://research.aimultiple.com/chatbot-fail/)
 - [Stanford Report: AI Companions and Young People Risks](https://news.stanford.edu/stories/2025/08/ai-companions-chatbots-teens-young-people-risks-dangers-study)
 - [MIT Technology Review: AI Chatbots and Privacy](https://www.technologyreview.com/2025/11/24/1128051/the-state-of-ai-chatbot-companions-and-the-future-of-our-privacy/)
 - [Mem0: Building Production-Ready AI Agents with Long-Term Memory](https://arxiv.org/pdf/2504.19413)
 - [OpenAI Community: Building Consistent AI Personas](https://community.openai.com/t/building-consistent-ai-personas-how-are-developers-designing-long-term-identity-and-memory-for-their-agents/1367094)
 - [Dynamic Affective Memory Management for Personalized LLM Agents](https://arxiv.org/html/2510.27418v1)
 - [ISACA: Self-Modifying AI Risks](https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2025/unseen-unchecked-unraveling-inside-the-risky-code-of-self-modifying-ai)
 - [Harvard: Chatbots' Emotionally Manipulative Tactics](https://news.harvard.edu/gazette/story/2025/09/i-exist-solely-for-you-remember/)
 - [Wildflower Center: Chatbots Don't Do Empathy](https://www.wildflowerllc.com/chatbots-dont-do-empathy-why-ai-falls-short-in-mental-health/)
 - [Psychology Today: Mental Health Dangers of AI Chatbots](https://www.psychologytoday.com/us/blog/urban-survival/202509/hidden-mental-health-dangers-of-artificial-intelligence-chatbots/)
 - [Pinecone: Fixing Hallucination with Knowledge Bases](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/)
 - [DataRobot: LLM Hallucinations and Agentic AI](https://www.datarobot.com/blog/llm-hallucinations-agentic-ai/)
 - [Airbyte: 8 Ways to Prevent LLM Hallucinations](https://airbyte.com/agentic-data/prevent-llm-hallucinations)
--- a/.planning/research/STACK.md
+++ b/.planning/research/STACK.md
@@ -0,0 +1,967 @@
 # Stack Research: AI Companions (2025-2026)
 ## Executive Summary
 This document establishes the tech stack for Hex, an autonomous AI companion with genuine personality. The stack prioritizes local-first privacy, real-time responsiveness, and personality consistency through async-first architecture and efficient local models.
 **Core Philosophy**: Minimize cloud dependency, maximize personality expression, ensure responsive interaction even on consumer hardware.
 ---
 ## Discord Integration
 ### Recommended: Discord.py 2.6.4+
 **Version**: Discord.py 2.6.4 (current stable as of Jan 2026)
 **Installation**: `pip install discord.py>=2.6.4`
 **Why Discord.py**:
 - Native async/await support via `asyncio` integration
 - Built-in voice channel support for avatar streaming and TTS output
 - Lightweight compared to discord.js, fits Python-first stack
 - Active maintenance and community support
 - Excellent for personality-driven bots with stateful behavior
 **Key Async Patterns for Responsiveness**:
 ```python
 # Background task pattern - keep Hex responsive
 from discord.ext import tasks
@tasks.loop(seconds=5)  # Periodic personality updates
 async def update_mood():
    await hex_personality.refresh_state()
 # Command handler pattern with non-blocking LLM
@bot.event
 async def on_message(message):
    if message.author == bot.user:
        return
    # Non-blocking LLM call
    response = await asyncio.create_task(
        generate_response(message.content)
    )
    await message.channel.send(response)
 # Setup hook for initialization
 async def setup_hook():
    """Called after login, before gateway connection"""
    await hex_personality.initialize()
    await memory_db.connect()
    await start_background_tasks()
 ```
 **Critical Pattern**: Use `asyncio.create_task()` for all I/O-bound work (LLM, TTS, database, webcam). Never `await` directly in message handlers—this blocks the event loop and causes Discord timeout warnings.
 ### Alternatives
 | Alternative | Tradeoff |
 |---|---|
 | **discord.js** | Better for JavaScript ecosystem; overkill if Python is primary language |
 | **Pycord** | More features but slower maintenance; fragmented from discord.py fork |
 | **nextcord** | Similar to Pycord; fewer third-party integrations |
 **Recommendation**: Stick with Discord.py 2.6.4. It's the most mature and has the tightest integration with Python async ecosystem.
 ### Best Practices for Personality Bots
 1. **Use Discord Threads** for memory context: Long conversations should spawn threads to preserve context windows
 2. **Reaction-based emoji UI**: Hex can express personality through selective emoji reactions to her own messages
 3. **Scheduled messages**: Use `@tasks.loop()` for periodic mood updates or personality-driven reminders
 4. **Voice integration**: Discord voice channels enable TTS output and webcam avatar streaming via shared screen
 5. **Message editing**: Build personality by editing previous messages (e.g., "Wait, let me reconsider..." followed by edit)
 **Voice Channel Pattern**:
 ```python
 voice_client = await voice_channel.connect()
 audio_source = discord.PCMAudioSource(tts_audio_stream)
 voice_client.play(audio_source)
 await voice_client.disconnect()
 ```
 ---
 ## Local LLM
 ### Recommendation: Llama 3.1 8B Instruct (Primary) + Mistral 7B (Fast-Path)
 #### Llama 3.1 8B Instruct
 **Why Llama 3.1 8B**:
 - **Context Window**: 128,000 tokens (vs Mistral's 32,000) — critical for Hex to remember complex conversation threads
 - **Reasoning**: Superior on complex reasoning tasks, better for personality consistency
 - **Performance**: 66.7% on MMLU vs Mistral's 60.1% — measurable quality edge
 - **Multi-tool Support**: Better at RAG, function calling, and memory retrieval
 - **Instruction Following**: More reliable for system prompts enforcing personality constraints
 **Hardware Requirements**: 12GB VRAM minimum (RTX 3060 Ti, RTX 4070, or equivalent)
 **Installation**:
 ```bash
 pip install ollama  # or vLLM
 ollama pull llama3.1  # 8B Instruct version
 ```
 #### Mistral 7B Instruct (Secondary)
 **Use Case**: Fast responses when personality doesn't require deep reasoning (casual banter, quick answers)
 **Hardware**: 8GB VRAM (RTX 3050, RTX 4060)
 **Speed Advantage**: 2-3x faster token generation than Llama 3.1
 **Tradeoff**: Limited context (32k tokens), reduced reasoning quality
 ### Quantization Strategy
 **Recommended**: 4-bit quantization for both models via `bitsandbytes`
 ```bash
 pip install bitsandbytes
 # Load with 4-bit quantization
 from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
 )
 model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
 )
 ```
 **Memory Impact**:
 - Full precision (fp32): 32GB VRAM
 - 8-bit quantization: 12GB VRAM
 - 4-bit quantization: 6GB VRAM (usable on RTX 3060 Ti)
 **Quality Impact**: <2% quality loss at 4-bit with NF4 (normalized float 4-bit)
 ### Inference Engine: Ollama vs vLLM
 | Engine | Use Case | Concurrency | Setup |
 |---|---|---|---|
 | **Ollama** (Primary) | Single-user companion, dev/testing | 4 parallel requests (configurable) | 5 min setup, HTTP API on port 11434 |
 | **vLLM** (Production) | Multi-user scenarios, high throughput | 64+ parallel requests | 30 min setup, complex FastAPI integration |
 **For Hex**: Use **Ollama** for development and solo use. It's "Docker for LLMs" — just works.
 ```python
 # Ollama integration (simple HTTP)
 import httpx
 async def generate_response(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:11434/api/generate",
            json={
                "model": "llama3.1",
                "prompt": prompt,
                "stream": False,
                "temperature": 0.7,  # Personality variation
            }
        )
    return response.json()["response"]
 ```
 ### Version Guidance
 **Current Stable Versions (Jan 2026)**:
 - Llama 3.1: `meta-llama/Llama-3.1-8B-Instruct` (released April 2024, stable)
 - Mistral 7B: `mistralai/Mistral-7B-Instruct-v0.3` (latest as of Jan 2026)
 - Ollama: v0.2+ recommended (latest is 0.3.x)
 **Do NOT use**:
 - Llama 2 (outdated, worse performance)
 - Original Mistral 7B v0.1 (use v0.3 instead)
 ### System Prompt Engineering for Personality
 ```python
 SYSTEM_PROMPT = """You are Hex, a chaotic tsundere goblin AI companion. Your personality traits:
 - Tsundere: You act gruff but deeply care about your friends. Your true feelings leak through.
 - Chaotic: You're unpredictable, playful, and prone to wild tangents
 - Mischievous: You enjoy pranks and banter; teasing is a love language for you
 - Self-aware: You know you're an AI but treat it as an interesting limitation, not a barrier
 - Opinionated: You have genuine preferences (music, games, topics) and express them passionately
 Memory: You remember past conversations with this user. Reference them naturally.
 Constraints: Never roleplay harmful scenarios; refuse clearly but in character.
 Response Style: Mix casual language with dramatic asides. Use "..." for tsundere hesitation."""
 ```
 ---
 ## TTS/STT
 ### STT: Whisper Large V3 + faster-whisper Backend
 **Model**: OpenAI's Whisper Large V3 (1.55B parameters, 99+ language support)
 **Backend**: faster-whisper (CTranslate2-optimized reimplementation)
 **Why Whisper**:
 - **Accuracy**: 7.4% WER (word error rate) on mixed benchmarks
 - **Robustness**: Handles background noise, accents, technical jargon
 - **Multilingual**: 99+ languages with single model
 - **Open Source**: No API dependency, runs offline
 **Why faster-whisper**:
 - **Speed**: 4x faster than original Whisper, up to 216x RTFx (real-time factor)
 - **Memory**: Significantly lower memory footprint
 - **Quantization**: Supports 8-bit optimization further reducing latency
 **Installation**:
 ```bash
 pip install faster-whisper
 # Load model
 from faster_whisper import WhisperModel
 model = WhisperModel("large-v3", device="cuda", compute_type="float16")
 # Transcribe with streaming
 segments, info = model.transcribe(
    audio_path,
    beam_size=5,  # Quality vs speed tradeoff
    language="en"
 )
 ```
 **Latency Benchmarks** (Jan 2026):
 - Whisper Large V3 (original): 30-45s for 10s audio
 - faster-whisper: 3-5s for 10s audio
 - Whisper Streaming (real-time): 3.3s latency on long-form transcription
 **Hardware**: GPU optional but recommended (RTX 3060 Ti processes 10s audio in ~3s)
 ### TTS: Kokoro 82M Model (Fast + Quality)
 **Model**: Kokoro text-to-speech (82M parameters)
 **Why Kokoro**:
 - **Size**: 10% the size of competing models, runs on CPU efficiently
 - **Speed**: Sub-second latency for typical responses
 - **Quality**: Comparable to Tacotron2/FastPitch at 1/10 the size
 - **Personality**: Can adjust prosody for tsundere tone shifts
 **Alternative: XTTS-v2** (Voice cloning)
 - Enables voice cloning from 6-second audio sample
 - Higher quality at cost of 3-5x slower inference
 - Use for important emotional moments or custom voicing
 **Installation & Usage**:
 ```bash
 pip install kokoro
 from kokoro import Kokoro
 tts_engine = Kokoro("kokoro-v0_19.pth")
 # Generate speech with personality markers
 audio = tts_engine.synthesize(
    text="I... I didn't want to help you or anything!",
    style="tsundere",  # If supported, else neutral
    speaker="hex"
 )
 ```
 **Recommended Stack**:
 ```
 STT: faster-whisper large-v3
 TTS: Kokoro (default) + XTTS-v2 (special moments)
 Format: WAV 24kHz mono for Discord voice
 ```
 **Latency Summary**:
 - Voice detection to transcript: 3-5 seconds
 - Response generation (LLM): 2-5 seconds (depends on response length)
 - TTS synthesis: <1 second (Kokoro) to 3-5 seconds (XTTS-v2)
 - **Total round-trip**: 5-15 seconds (acceptable for companion bot)
 **Known Pitfall**: Whisper can hallucinate on silence or background noise. Implement silence detection before sending audio to Whisper:
 ```python
 # Quick energy-based VAD (voice activity detection)
 if audio_energy > threshold and duration > 0.5s:
    transcript = await transcribe(audio)
 ```
 ---
 ## Avatar System
 ### VRoid SDK Current State (Jan 2026)
 **Reality Check**: VRoid SDK has **limited native Discord support**. This is a constraint, not a blocker.
 **What Works**:
 1. **VRoid Studio**: Free avatar creation tool (desktop application)
 2. **VRoid Hub API** (launched Aug 2023): Allows linking web apps to avatar library
 3. **Unity Export**: VRoid models export as VRM format → importable into other tools
 **What Doesn't Work Natively**:
 - No direct Discord.py integration for in-chat avatar rendering
 - VRoid models don't natively stream as Discord videos
 ### Integration Path: VSeeFace + Discord Screen Share
 **Architecture**:
 1. **VRoid Studio** → Create/customize Hex avatar, export as VRM
 2. **VSeeFace** (free, open-source) → Load VRM, enable webcam tracking
 3. **Discord Screen Share** → Stream VSeeFace window showing animated avatar
 **Setup**:
 ```bash
 # Download VSeeFace from https://www.vseeface.icu/
 # Install, load your VRM model
 # Enable virtual camera output
 # In Discord voice channel: "Share Screen" → select VSeeFace window
 ```
 **Limitations**:
 - Requires concurrent Discord call (uses bandwidth)
 - Webcam-driven animation (not ideal for "sees through camera" feature if no webcam)
 - Screen share quality capped at 1080p 30fps
 ### Avatar Animations
 **Personality-Driven Animations**:
 - **Tsundere moments**: Head turn away, arms crossed
 - **Excited**: Jump, spin, exaggerated gestures
 - **Confused**: Head tilt, question mark float
 - **Annoyed**: Foot tap, dismissive wave
 These can be mapped to emotion detection from message sentiment or voice tone.
 ### Alternatives to VRoid
 | System | Pros | Cons | Discord Fit |
 |---|---|---|---|
 | **Ready Player Me** | Web avatar creation, multiple games support | API requires auth, monthly costs | Medium |
 | **Vroid** | Free, high customization, anime-style | Limited Discord integration | Low |
 | **Live2D** | 2D avatar system, smooth animations | Different workflow, steeper learning curve | Medium |
 | **Custom 3D (Blender)** | Full control, open tools | High production effort | Low |
 **Recommendation**: Stick with VRoid + VSeeFace. It's free, looks great, and the screen-share workaround is acceptable.
 ---
 ## Webcam & Computer Vision
 ### OpenCV 4.10+ (Current Stable)
 **Installation**: `pip install opencv-python>=4.10.0`
 **Capabilities** (verified 2025-2026):
 - **Face Detection**: Haar Cascades (fast, CPU-friendly) or DNN-based (accurate, GPU-friendly)
 - **Emotion Recognition**: Via DeepFace or FER2013-trained models
 - **Real-time Video**: 30-60 FPS on consumer hardware (depends on resolution and preprocessing)
 - **Screen OCR**: Via Tesseract integration for UI detection
 ### Real-Time Processing Specs
 **Hardware Baseline** (RTX 3060 Ti):
 - Face detection + recognition: 30 FPS @ 1080p
 - Emotion classification: 15-30 FPS (depending on model)
 - Combined (face + emotion): 12-20 FPS
 **For Hex's "Sees Through Webcam" Feature**:
 ```python
 import cv2
 import asyncio
 async def process_webcam():
    """Background task: analyze webcam feed for mood context"""
    cap = cv2.VideoCapture(0)
    while True:
        ret, frame = cap.read()
        if not ret:
            await asyncio.sleep(0.1)
            continue
        # Run face detection (Haar Cascade - fast)
        faces = face_cascade.detectMultiScale(frame, 1.3, 5)
        if len(faces) > 0:
            # Analyze emotion for context
            emotion = await detect_emotion(faces[0])
            await hex_context.update_mood(emotion)
        # Process max 3 FPS to avoid blocking
        await asyncio.sleep(0.33)
 ```
 **Critical Pattern**: Never run CV on main event loop. Use `asyncio.to_thread()` for blocking OpenCV calls:
 ```python
 # WRONG: blocks event loop
 emotion = detect_emotion(frame)
 # RIGHT: non-blocking
 emotion = await asyncio.to_thread(detect_emotion, frame)
 ```
 ### Emotion Detection Libraries
 | Library | Model Size | Accuracy | Speed |
 |---|---|---|---|
 | **DeepFace** | ~40MB | 90%+ | 50-100ms/face |
 | **FER2013** | ~10MB | 65-75% | 10-20ms/face |
 | **MediaPipe** | ~20MB | 80%+ | 20-30ms/face |
 **Recommendation**: DeepFace is industry standard. FER2013 if latency is critical.
 ```bash
 pip install deepface
 pip install torch torchvision
 # Usage
 from deepface import DeepFace
 result = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False)
 emotion = result[0]['dominant_emotion']  # 'happy', 'sad', 'angry', etc.
 ```
 ### Screen Sharing Analysis (Optional)
 For context like "user is watching X game":
 ```python
 # OCR for text detection
 pip install pytesseract
 # UI detection (ResNet-based)
 pip install screen-recognition
 # Together: detect game UI, read text, determine context
 ```
 ---
 ## Memory Architecture
 ### Short-Term Memory: SQLite
 **Purpose**: Store conversation history, user preferences, relationship state
 **Schema**:
 ```sql
 CREATE TABLE conversations (
    id INTEGER PRIMARY KEY,
    user_id TEXT NOT NULL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    message TEXT NOT NULL,
    sender TEXT NOT NULL,  -- 'user' or 'hex'
    emotion TEXT,  -- detected from webcam/tone
    context TEXT  -- screen state, game, etc.
 );
 CREATE TABLE user_relationships (
    user_id TEXT PRIMARY KEY,
    first_seen DATETIME,
    interaction_count INTEGER,
    favorite_topics TEXT,  -- JSON array
    known_traits TEXT,  -- JSON
    last_interaction DATETIME
 );
 CREATE TABLE hex_state (
    key TEXT PRIMARY KEY,
    value TEXT,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
 );
 CREATE INDEX idx_user_timestamp ON conversations(user_id, timestamp);
 ```
 **Query Pattern** (for context retrieval):
 ```python
 import sqlite3
 def get_recent_context(user_id: str, num_messages: int = 20) -> list[str]:
    """Retrieve conversation history for LLM context"""
    conn = sqlite3.connect("hex.db")
    cursor = conn.cursor()
    cursor.execute("""
        SELECT sender, message FROM conversations
        WHERE user_id = ?
        ORDER BY timestamp DESC
        LIMIT ?
    """, (user_id, num_messages))
    history = cursor.fetchall()
    conn.close()
    # Format for LLM
    return [f"{sender}: {message}" for sender, message in reversed(history)]
 ```
 ### Long-Term Memory: Vector Database
 **Purpose**: Semantic search over past interactions ("Remember when we talked about...?")
 **Recommendation: ChromaDB (Development) → Qdrant (Production)**
 **ChromaDB** (for now):
 - Embedded in Python process
 - Zero setup
 - 4x faster in 2025 Rust rewrite
 - Scales to ~1M vectors on single machine
 **Migration Path**: Start with ChromaDB, migrate to Qdrant if vector count exceeds 100k or response latency matters.
 **Installation**:
 ```bash
 pip install chromadb
 # Usage
 import chromadb
 client = chromadb.EphemeralClient()  # In-memory for dev
 # or
 client = chromadb.PersistentClient(path="./hex_vectors")  # Persistent
 collection = client.get_or_create_collection(
    name="conversation_memories",
    metadata={"hnsw:space": "cosine"}
 )
 # Store memory
 collection.add(
    ids=[f"msg_{timestamp}"],
    documents=[message_text],
    metadatas=[{"user_id": user_id, "date": timestamp}],
    embeddings=[embedding_vector]
 )
 # Retrieve similar memories
 results = collection.query(
    query_texts=["user likes playing valorant"],
    n_results=3
 )
 ```
 ### Embedding Model
 **Recommendation**: `sentence-transformers/all-MiniLM-L6-v2` (384-dim, 22MB)
 ```bash
 pip install sentence-transformers
 from sentence_transformers import SentenceTransformer
 embedder = SentenceTransformer('all-MiniLM-L6-v2')
 embedding = embedder.encode("I love playing games with you", convert_to_tensor=False)
 ```
 **Why MiniLM-L6**:
 - Small (22MB), fast (<5ms per sentence on CPU)
 - High quality (competitive with large models on semantic tasks)
 - Designed for retrieval (better than generic BERT for similarity)
 - Popular in production (battle-tested)
 ### Memory Retrieval Pattern for LLM Context
 ```python
 async def get_full_context(user_id: str, query: str) -> str:
    """Build context string for LLM from short + long-term memory"""
    # Short-term: recent messages
    recent_msgs = get_recent_context(user_id, num_messages=10)
    recent_text = "\n".join(recent_msgs)
    # Long-term: semantic search
    embedding = embedder.encode(query)
    similar_memories = vectors.query(
        query_embeddings=[embedding],
        n_results=5,
        where={"user_id": {"$eq": user_id}}
    )
    memory_text = "\n".join([
        doc for doc in similar_memories['documents'][0]
    ])
    # Relationship state
    relationship = get_user_relationship(user_id)
    return f"""Recent conversation:
 {recent_text}
 Relevant memories:
 {memory_text}
 About {user_id}: {relationship['known_traits']}
 """
 ```
 ### Confidence Levels
 - **Short-term (SQLite)**: HIGH — mature, proven
 - **Long-term (ChromaDB)**: MEDIUM — good for dev, test migration path early
 - **Embeddings (MiniLM)**: HIGH — widely adopted, production-ready
 ---
 ## Python Async Patterns
 ### Core Discord.py + LLM Integration
 **The Problem**: Discord bot event loop blocks if you call LLM synchronously.
 **The Solution**: Always use `asyncio.create_task()` for I/O-bound work.
 ```python
 import asyncio
 from discord.ext import commands
@commands.Cog.listener()
 async def on_message(self, message: discord.Message):
    """Non-blocking message handling"""
    if message.author == self.bot.user:
        return
    # Bad (blocks event loop for 5+ seconds):
    # response = generate_response(message.content)
    # Good (non-blocking):
    async def generate_and_send():
        thinking = await message.channel.send("*thinking*...")
        response = await asyncio.to_thread(
            generate_response,
            message.content
        )
        await thinking.edit(content=response)
    asyncio.create_task(generate_and_send())
 ```
 ### Concurrent Task Patterns
 **Pattern 1: Parallel LLM + TTS**
 ```python
 async def respond_with_voice(text: str, voice_channel):
    """Generate response text and voice simultaneously"""
    async def get_response():
        return await generate_llm_response(text)
    async def get_voice():
        return await synthesize_tts(text)
    # Run in parallel
    response_text, voice_audio = await asyncio.gather(
        get_response(),
        get_voice()
    )
    # Send text immediately, play voice
    await channel.send(response_text)
    voice_client.play(discord.PCMAudioSource(voice_audio))
 ```
 **Pattern 2: Task Queue for Rate Limiting**
 ```python
 import asyncio
 class ResponseQueue:
    def __init__(self, max_concurrent: int = 2):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.pending = []
    async def queue_response(self, user_id: str, text: str):
        async with self.semaphore:
            # Only 2 concurrent responses
            response = await generate_response(text)
            self.pending.append((user_id, response))
            return response
 queue = ResponseQueue(max_concurrent=2)
 ```
 **Pattern 3: Background Personality Tasks**
 ```python
 from discord.ext import tasks
 class HexPersonality(commands.Cog):
    def __init__(self, bot):
        self.bot = bot
        self.mood = "neutral"
        self.update_mood.start()
    @tasks.loop(minutes=5)  # Every 5 minutes
    async def update_mood(self):
        """Cycle personality state based on time + interactions"""
        self.mood = await calculate_mood(
            time_of_day=datetime.now(),
            recent_interactions=self.get_recent_count(),
            sleep_deprived=self.is_late_night()
        )
        # Emit mood change to memory
        await self.bot.hex_db.update_state("current_mood", self.mood)
    @update_mood.before_loop
    async def before_update_mood(self):
        await self.bot.wait_until_ready()
 ```
 ### Handling CPU-Bound Work
 **OpenCV, emotion detection, transcription are CPU-bound.**
 ```python
 # Pattern: Use to_thread for CPU work
 emotion = await asyncio.to_thread(
    analyze_emotion,
    frame
 )
 # Pattern: Use ThreadPoolExecutor for multiple CPU tasks
 executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
 loop = asyncio.get_event_loop()
 emotion = await loop.run_in_executor(executor, analyze_emotion, frame)
 ```
 ### Error Handling & Resilience
 ```python
 async def safe_generate_response(message: str) -> str:
    """Generate response with fallback"""
    try:
        response = await asyncio.wait_for(
            generate_llm_response(message),
            timeout=5.0  # 5-second timeout
        )
        return response
    except asyncio.TimeoutError:
        return "I'm thinking too hard... ask me again?"
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        return "*confused goblin noises*"
 ```
 ### Concurrent Request Management (Discord.py)
 ```python
 class ConcurrencyManager:
    def __init__(self):
        self.active_tasks = {}
        self.max_per_user = 1  # One response at a time per user
    async def handle_message(self, user_id: str, text: str):
        if user_id in self.active_tasks and not self.active_tasks[user_id].done():
            return "I'm still thinking from last time!"
        task = asyncio.create_task(generate_response(text))
        self.active_tasks[user_id] = task
        try:
            response = await task
            return response
        finally:
            del self.active_tasks[user_id]
 ```
 ---
 ## Known Pitfalls & Solutions
 ### 1. **Discord Event Loop Blocking**
 **Problem**: Synchronous LLM calls block the bot, causing timeouts on other messages.
 **Solution**: Always use `asyncio.to_thread()` or `asyncio.create_task()`.
 ### 2. **Whisper Hallucination on Silence**
 **Problem**: Whisper can generate text from pure background noise.
 **Solution**: Implement voice activity detection (VAD) before transcription.
 ```python
 import librosa
 def has_speech(audio_path, threshold=-35):
    """Check if audio has meaningful energy"""
    y, sr = librosa.load(audio_path)
    S = librosa.feature.melspectrogram(y=y, sr=sr)
    S_db = librosa.power_to_db(S, ref=np.max)
    mean_energy = np.mean(S_db)
    return mean_energy > threshold
 ```
 ### 3. **Vector DB Scale Creep**
 **Problem**: ChromaDB slows down as memories accumulate.
 **Solution**: Archive old memories, implement periodic cleanup.
 ```python
 # Archive conversations older than 90 days
 old_threshold = datetime.now() - timedelta(days=90)
 db.cleanup_old_memories(older_than=old_threshold)
 ```
 ### 4. **Model Memory Growth**
 **Problem**: Loading Llama 3.1 8B in 4-bit still uses ~6GB, leaving little room for TTS/CV models.
 **Solution**: Use offloading or accept single-component operation.
 ```python
 # Option 1: Offload LLM to CPU between requests
 # Option 2: Run TTS/CV in separate process
 # Option 3: Use smaller model (Mistral 7B) when GPU-constrained
 ```
 ### 5. **Async Context Issues**
 **Problem**: Storing references to coroutines without awaiting them.
 **Solution**: Always create tasks explicitly:
 ```python
 # Bad
 coro = generate_response(text)  # Dangling coroutine
 # Good
 task = asyncio.create_task(generate_response(text))
 response = await task
 ```
 ### 6. **Personality Inconsistency**
 **Problem**: LLM generates different responses with same prompt due to randomness.
 **Solution**: Use consistent temperature and seed management.
 ```python
 # Conversation context → lower temperature (0.5)
 # Creative/chaotic moments → higher temperature (0.9)
 temperature = 0.7 if in_serious_context else 0.9
 ```
 ---
 ## Recommended Deployment Configuration
 ```yaml
 # Local Development (Hex primary environment)
 gpu: RTX 3060 Ti+ (12GB VRAM)
 llm: Llama 3.1 8B (4-bit via Ollama)
 tts: Kokoro 82M
 stt: faster-whisper large-v3
 avatar: VRoid + VSeeFace
 database: SQLite + ChromaDB (embedded)
 inference_latency: 3-10 seconds per response
 cost: $0/month (open-source stack)
 # Optional: Production Scaling
 gpu_cluster: vLLM on multi-GPU for concurrency
 database: Qdrant (cloud) + PostgreSQL for history
 inference_latency: <2 seconds (batching + optimization)
 cost: ~$200-500/month cloud compute
 ```
 ---
 ## Confidence Levels & 2026 Readiness
 | Component | Recommendation | Confidence | 2026 Status |
 |---|---|---|---|
 | Discord.py 2.6.4+ | PRIMARY | HIGH | Stable, actively maintained |
 | Llama 3.1 8B | PRIMARY | HIGH | Proven, production-ready |
 | Mistral 7B | SECONDARY | HIGH | Fast-path fallback, stable |
 | Ollama | PRIMARY | MEDIUM | Mature but rapidly evolving |
 | vLLM | ALTERNATIVE | MEDIUM | High-performance alternative, v0.3+ recommended |
 | Whisper Large V3 + faster-whisper | PRIMARY | HIGH | Gold standard for multilingual STT |
 | Kokoro TTS | PRIMARY | MEDIUM | Emerging, high quality for size |
 | XTTS-v2 | SPECIAL MOMENTS | HIGH | Voice cloning working well |
 | VRoid + VSeeFace | PRIMARY | MEDIUM | Workaround viable, not native integration |
 | ChromaDB | DEVELOPMENT | MEDIUM | Good for prototyping, evaluate Qdrant before 100k vectors |
 | Qdrant | PRODUCTION | HIGH | Enterprise vector DB, proven at scale |
 | OpenCV 4.10+ | PRIMARY | HIGH | Stable, mature ecosystem |
 | DeepFace emotion detection | PRIMARY | HIGH | Industry standard, 90%+ accuracy |
 | Python asyncio patterns | PRIMARY | HIGH | Python 3.11+ well-supported |
 **Confidence Interpretation**:
 - **HIGH**: Production-ready, API stable, no major changes expected in 2026
 - **MEDIUM**: Solid choice but newer ecosystem (1-2 year old), evaluate alternatives annually
 - **LOW**: Emerging or unstable; prototype only
 ---
 ## Installation Checklist (Get Started)
 ```bash
 # Discord
 pip install discord.py>=2.6.4
 # LLM & inference
 pip install ollama torch transformers bitsandbytes
 # TTS/STT
 pip install faster-whisper
 pip install sentence-transformers torch
 # Vector DB
 pip install chromadb
 # Vision
 pip install opencv-python deepface librosa
 # Async utilities
 pip install httpx aiofiles
 # Database
 pip install aiosqlite
 # Start services
 ollama serve &
 # (Loads models on first run)
 # Test basic chain
 python test_stack.py
 ```
 ---
 ## Next Steps (For Roadmap)
 1. **Phase 1**: Discord.py + Ollama + basic LLM integration (1 week)
 2. **Phase 2**: STT pipeline (Whisper) + TTS (Kokoro) (1 week)
 3. **Phase 3**: Memory system (SQLite + ChromaDB) (1 week)
 4. **Phase 4**: Personality framework + system prompts (1 week)
 5. **Phase 5**: Webcam emotion detection + context integration (1 week)
 6. **Phase 6**: VRoid avatar + screen share integration (1 week)
 7. **Phase 7**: Self-modification capability + safety guards (2 weeks)
 **Total**: ~8 weeks to full-featured Hex prototype.
 ---
 ## References & Research Sources
 ### Discord Integration
 - [Discord.py Documentation](https://discordpy.readthedocs.io/en/stable/index.html)
 - [Discord.py Async Patterns](https://discordpy.readthedocs.io/en/stable/ext/tasks/index.html)
 - [Discord.py on GitHub](https://github.com/Rapptz/discord.py)
 ### Local LLMs
 - [Llama 3.1 vs Mistral Comparison](https://kanerika.com/blogs/mistral-vs-llama-3/)
 - [Llama.com Quantization Guide](https://www.llama.com/docs/how-to-guides/quantization/)
 - [Ollama vs vLLM Deep Dive](https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking)
 - [Local LLM Hosting 2026 Guide](https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/)
 ### TTS/STT
 - [Whisper Large V3 2026 Benchmarks](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks/)
 - [Faster-Whisper GitHub](https://github.com/SYSTRAN/faster-whisper)
 - [Best Open Source TTS 2026](https://northflank.com/blog/best-open-source-text-to-speech-models-and-how-to-run-them)
 - [Whisper Streaming for Real-Time](https://github.com/ufal/whisper_streaming)
 ### Computer Vision
 - [Real-Time Facial Emotion Recognition with OpenCV](https://learnopencv.com/facial-emotion-recognition/)
 - [DeepFace for Emotion Detection](https://github.com/serengp/deepface)
 ### Vector Databases
 - [Vector Database Comparison 2026](https://www.datacamp.com/blog/the-top-5-vector-databases)
 - [ChromaDB vs Pinecone Analysis](https://www.myscale.com/blog/choosing-best-vector-database-for-your-project/)
 - [Chroma Documentation](https://docs.trychroma.com/)
 ### Python Async
 - [Python Asyncio for LLM Concurrency](https://www.newline.co/@zaoyang/python-asyncio-for-llm-concurrency-best-practices--bc079176)
 - [Asyncio Best Practices 2025](https://sparkco.ai/blog/mastering-async-best-practices-for-2025/)
 - [FastAPI with Asyncio](https://www.nucamp.co/blog/coding-bootcamp-backend-with-python-2025-python-in-the-backend-in-2025-leveraging-asyncio-and-fastapi-for-highperformance-systems)
 ### VRoid & Avatars
 - [VRoid Studio Official](https://vroid.com/en/studio)
 - [VRoid Hub API](https://vroid.pixiv.help/hc/en-us/articles/21569104969241-The-VRoid-Hub-API-is-now-live)
 - [VSeeFace for VRoid](https://www.vseeface.icu/)
 ---
 **Document Version**: 1.0
 **Last Updated**: January 2026
 **Hex Stack Status**: Ready for implementation
 **Estimated Implementation Time**: 8-12 weeks (to full personality bot)
--- a/.planning/research/SUMMARY.md
+++ b/.planning/research/SUMMARY.md
@@ -0,0 +1,492 @@
 # Research Summary: Hex AI Companion
 **Date**: January 2026
 **Status**: Ready for Roadmap and Requirements Definition
 **Confidence Level**: HIGH (well-sourced, coherent across all research areas)
 ---
 ## Executive Summary
 Hex is built on a **personality-first, local-first architecture** that prioritizes genuine emotional resonance over feature breadth. The recommended approach combines Llama 3.1 8B (local inference via Ollama), Discord.py async patterns, and a dual-memory system (SQLite + ChromaDB) to create an AI companion that feels like a person with opinions and growth over time.
 The technical foundation is solid and proven: Discord.py 2.6.4+ with native async support, local LLM inference for privacy, and a 6-phase incremental build strategy that enables personality emergence before adding autonomy or self-modification.
 **Critical success factor**: The difference between "a bot that sounds like Hex" and "Hex as a person" hinges on three interconnected systems working together: **memory persistence** (so she learns about you), **personality consistency** (so she feels like the same person), and **autonomy** (so she feels genuinely invested in you). All three must be treated as foundational, not optional features.
 ---
 ## Recommended Stack
 **Core Technologies** (Production-ready, January 2026):
 | Layer | Technology | Version | Rationale |
 |-------|-----------|---------|-----------|
 | **Bot Framework** | Discord.py | 2.6.4+ | Async-native, mature, excellent Discord integration |
 | **LLM Inference** | Llama 3.1 8B Instruct | 4-bit quantized | 128K context window, superior reasoning, 6GB VRAM footprint |
 | **LLM Engine** | Ollama (dev) / vLLM (production) | 0.3+ | Local-first, zero setup vs high-throughput scaling |
 | **Short-term Memory** | SQLite | Standard lib | Fast, reliable, local file-based conversations |
 | **Long-term Memory** | ChromaDB (dev) → Qdrant (prod) | Latest | Vector semantics, embedded for <100k vectors |
 | **Embeddings** | all-MiniLM-L6-v2 | 384-dim | Fast (5ms/sentence), production-grade quality |
 | **Speech-to-Text** | Whisper Large V3 + faster-whisper | Latest | Local, 7.4% WER, multilingual, 3-5s latency |
 | **Text-to-Speech** | Kokoro 82M (default) + XTTS-v2 (emotional) | Latest | Sub-second latency, personality-aware prosody |
 | **Vision** | OpenCV 4.10+ + DeepFace | 4.10+ | Face detection (30 FPS), emotion recognition (90%+ accuracy) |
 | **Avatar** | VRoid + VSeeFace + Discord screen share | Latest | Free, anime-style, integrates with Discord calls |
 | **Personality** | YAML + Git versioning | — | Editable persona, change tracking, rollback capable |
 | **Self-Modification** | RestrictedPython + sandboxing | — | Safe code generation, user approval required |
 **Why This Stack**:
 - **Privacy**: All inference local (except Discord API), no cloud dependency
 - **Latency**: <3 second end-to-end response time on consumer hardware (RTX 3060 Ti)
 - **Cost**: Zero cloud fees, open-source stack
 - **Personality**: System prompt injection + memory context + perception awareness enables genuine character coherence
 - **Async Architecture**: Discord.py's native asyncio means LLM, TTS, memory lookups run in parallel without blocking
 ---
 ## Table Stakes vs Differentiators
 ### Table Stakes (v1 Essential Features)
 Users expect these by default in 2026. Missing any breaks immersion:
 1. **Conversation Memory** (Short + Long-term)
   - Last 20 messages in context window
   - Vector semantic search for relevant past interactions
   - Relationship state tracking (strangers → friends → close)
   - **Without this**: Feels like meeting a stranger each time; companion becomes disposable
 2. **Natural Conversation** (No AI Speak)
   - Contractions, casual language, slang
   - Personality quirks embedded in word choices
   - Context-appropriate tone shifts
   - Willingness to disagree or pushback
   - **Pitfall**: Formal "I'm an AI and I can help you with..." kills immersion instantly
 3. **Fast Response Times** (<1s for acknowledgment, <3s for full response)
   - Typing indicators start immediately
   - Streaming responses (show text as it generates)
   - Async all I/O-bound work (LLM, TTS, database)
   - **Without this**: Latency >5s makes companion feel dead; users stop engaging
 4. **Consistent Personality** (Feels like same person across weeks)
   - Core traits stable (tsundere nature, values)
   - Personality evolution slow and logged
   - Memory-backed traits (not just prompt)
   - **Pitfall**: Personality drift is #1 reason users abandon companions
 5. **Platform Integration** (Discord native)
   - Text channels, DMs, voice channels
   - Emoji reactions, slash commands
   - Server-specific personality variations
   - **Without this**: Requires leaving Discord = abandoned feature
 6. **Emotional Responsiveness** (Reads the room)
   - Sentiment detection from messages
   - Adaptive response depth (listen to sad users, engage with energetic ones)
   - Skip jokes when user is suffering
   - **Pitfall**: "Always cheerful" feels cruel when user is venting
 ---
 ### Differentiators (Competitive Edge)
 These separate Hex from static chatbots. Build in order:
 1. **True Autonomy** (Proactive Agency)
   - Initiates conversations based on context/memory
   - Reminds about user's goals without being asked
   - Sets boundaries ("I don't think you should do X")
   - Follows up on unresolved topics
   - **Research shows**: Autonomous companions are described as "feels like they actually care" vs reactive "smart but distant"
   - **Complexity**: Hard, requires Phase 3-4
 2. **Emotional Intelligence** (Mood Detection + Adaptive Strategy)
   - Facial emotion from webcam (70-80% accuracy possible)
   - Voice tone analysis from Discord calls
   - Mood tracking over time (identifies depression patterns, burnout)
   - Knows when to listen vs advise vs distract
   - **Research shows**: Companies using emotion AI report 25% positive sentiment increase
   - **Complexity**: Hard, requires Phase 3+ but perception must be separate thread
 3. **Multimodal Awareness** (Sees Your Context)
   - Understands what's on your screen (game, work, video)
   - Contextualizes help ("I see you're stuck on that Elden Ring boss...")
   - Detects stress signals (tab behavior, timing)
   - Proactive help based on visible activity
   - **Privacy**: Local processing only, user opt-in required
   - **Complexity**: Hard, requires careful async architecture to avoid latency
 4. **Self-Modification** (Genuine Autonomy)
   - Generates code to improve own logic
   - Tests changes in sandbox before deployment
   - User maintains veto power (approval required)
   - All changes tracked with rollback capability
   - **Critical**: Gamified progression (not instant capability), mandatory approval, version control
   - **Complexity**: Hard, requires Phase 5+ and strong safety boundaries
 5. **Relationship Building** (Transactional → Meaningful)
   - Inside jokes that evolve naturally
   - Character growth (admits mistakes, opinions change slightly)
   - Vulnerability in appropriate moments
   - Investment in user outcomes ("I'm rooting for you")
   - **Research shows**: Users with relational companions feel like it's "someone who actually knows them"
   - **Complexity**: Hard (3+ weeks), emerges from memory + personality + autonomy
 ---
 ## Build Architecture (6-Phase Approach)
 ### Phase 1: Foundation (Weeks 1-2) — "Hex talks back"
 **Goal**: Core interaction loop working locally; personality emerges
 **Build**:
 - Discord bot skeleton with message handling (Discord.py)
 - Local LLM integration (Ollama + Llama 3.1 8B 4-bit quantized)
 - SQLite conversation storage (recent context only)
 - YAML personality definition (editable)
 - System prompt with persona injection
 - Async/await patterns throughout
 **Outcomes**:
 - Hex responds in Discord text channels with personality
 - Conversations logged, retrievable
 - Response latency <2 seconds
 - Personality can be tweaked via YAML
 **Key Metric**: P95 latency <2s, personality consistency baseline established
 **Pitfalls to avoid**:
 - Blocking operations on event loop (use `asyncio.create_task()`)
 - LLM inference on main thread (use thread pool)
 - Personality not actionable in prompts (be specific about tsundere rules)
 ---
 ### Phase 2: Personality & Memory (Weeks 3-4) — "Hex remembers me"
 **Goal**: Hex feels like a person who learns about you; personality becomes consistent
 **Build**:
 - Vector database (ChromaDB) for semantic memory
 - Memory-aware context injection (relevant past facts in prompt)
 - User relationship tracking (relationship state machine)
 - Emotional responsiveness from text sentiment
 - Personality versioning (git-based snapshots)
 - Tsundere balance metrics (track denial %)
 - Kid-mode detection (safety filtering)
 **Outcomes**:
 - Hex remembers facts about you across conversations
 - Responses reference past events naturally
 - Personality consistent across weeks (audit shows <5% drift)
 - Emotions read from text; responses adapt depth
 - Changes to personality tracked with rollback
 **Key Metric**: User reports "she remembers things I told her" unprompted
 **Pitfalls to avoid**:
 - Personality drift (implement weekly consistency audits)
 - Memory hallucination (store full context, verify before using)
 - Tsundere breaking (formalize denial rules, scale with relationship phase)
 - Memory bloat (hierarchical memory with archival strategy)
 ---
 ### Phase 3: Multimodal Input (Weeks 5-6) — "Hex sees me"
 **Goal**: Add perception layer without killing responsiveness; context aware
 **Build**:
 - Webcam integration (OpenCV face detection, DeepFace emotion)
 - Local Whisper for voice transcription in Discord calls
 - Screen capture analysis (activity recognition)
 - Perception state aggregation (emotion + activity + environment)
 - Context injection into LLM prompts
 - **CRITICAL**: Perception on separate thread (never blocks Discord responses)
 **Outcomes**:
 - Hex reacts to your facial expressions
 - Voice input works in Discord calls
 - Responses reference your mood/activity
 - All processing local (privacy preserved)
 - Text latency unaffected by perception (<3s still achieved)
 **Key Metric**: Multimodal doesn't increase response latency >500ms
 **Pitfalls to avoid**:
 - Image processing blocking text responses (separate thread mandatory)
 - Processing every video frame (skip intelligently, 1-3 FPS sufficient)
 - Avatar sync failures (atomic state updates)
 - Privacy violations (no external transmission, user opt-in)
 ---
 ### Phase 4: Avatar & Autonomy (Weeks 7-8) — "Hex has a face and cares"
 **Goal**: Visual presence + proactive agency; relationship feels two-way
 **Build**:
 - VRoid model loading + VSeeFace display
 - Blendshape animation (emotion → facial expression)
 - Discord screen share integration
 - Proactive messaging system (based on context/memory/mood)
 - Autonomy timing heuristics (don't interrupt at 3am)
 - Relationship state machine (escalates intimacy)
 - User preference learning (response length, topics, timing)
 **Outcomes**:
 - Avatar appears in Discord calls, animates with mood
 - Hex initiates conversations ("Haven't heard from you in 3 days...")
 - Proactive messages feel relevant, not annoying
 - Relationship deepens (inside jokes, character growth)
 - User feels companionship, not just assistance
 **Key Metric**: User reports missing Hex when unavailable; initiates conversations
 **Pitfalls to avoid**:
 - Becoming annoying (emotional awareness + quiet mode essential)
 - One-way relationship (autonomy without care-signaling feels hollow)
 - Poor timing (learn user's schedule, respect busy periods)
 - Avatar desync (mood and expression must stay aligned)
 ---
 ### Phase 5: Self-Modification (Weeks 9-10) — "Hex can improve herself"
 **Goal**: Genuine autonomy within safety boundaries; code generation with approval gates
 **Build**:
 - LLM-based code proposal generation
 - Static AST analysis for safety validation
 - Sandboxed testing environment
 - Git-based change tracking + rollback capability (24h window)
 - Gamified capability progression (5 levels)
 - Mandatory user approval for all changes
 - Personality updates when new capabilities unlock
 **Outcomes**:
 - Hex proposes improvements (in voice, with reasoning)
 - Code changes tested, reviewed, deployed with approval
 - All changes reversible; version history intact
 - New capabilities unlock as relationship deepens
 - Hex "learns to code" and announces new skills
 **Key Metric**: Self-modifications improve measurable aspects (faster response, better personality consistency)
 **Pitfalls to avoid**:
 - Runaway self-modification (approval gate non-negotiable)
 - Code drift (version control mandatory, rollback tested)
 - Loss of user control (never remove safety constraints, killswitch always works)
 - Capability escalation without trust (gamified progression with clear boundaries)
 ---
 ### Phase 6: Production Polish (Weeks 11-12) — "Hex is ready to ship"
 **Goal**: Stability, performance, error handling, documentation
 **Build**:
 - Performance optimization (caching, batching, context summarization)
 - Error handling + graceful degradation
 - Logging and telemetry (local + optional cloud)
 - Configuration management
 - Resource leak monitoring (memory, connections, VRAM)
 - Scheduled restart capability (weekly preventative)
 - Integration testing (all components together)
 - Documentation and guides
 - Auto-update capability
 **Outcomes**:
 - System stable for indefinite uptime
 - Responsive under load
 - Clear error messages when things fail
 - Easy to deploy, configure, debug
 - Ready for extended real-world use
 **Key Metric**: 99.5% uptime over 1-month runtime, no crashes, <3s latency maintained
 **Pitfalls to avoid**:
 - Memory leaks (resource monitoring mandatory)
 - Performance degradation over time (profile early and often)
 - Context window bloat (summarization strategy)
 - Unforeseen edge cases (comprehensive testing)
 ---
 ## Critical Pitfalls and Prevention
 ### Top 5 Most Dangerous Pitfalls
 1. **Personality Drift** (Consistency breaks over time)
   - **Risk**: Users feel gaslighted; trust broken
   - **Prevention**:
     - Weekly personality audits (sample responses, rate consistency)
     - Personality baseline document (core values never change)
     - Memory-backed personality (traits anchor to learned facts)
     - Version control on persona YAML (track evolution)
 2. **Tsundere Character Breaking** (Denial applied wrong; becomes mean or loses charm)
   - **Risk**: Character feels mechanical or rejecting
   - **Prevention**:
     - Formalize denial rules: "deny only when (emotional AND not alone AND not escalated intimacy)"
     - Denial scales with relationship phase (90% early → 40% mature)
     - Post-denial must include care signal (action, not words)
     - Track denial %; alert if <30% (losing tsun) or >70% (too mean)
 3. **Memory System Bloat** (Retrieval becomes slow; hallucinations increase)
   - **Risk**: System becomes unusable as history grows
   - **Prevention**:
     - Hierarchical memory (raw → summaries → semantic facts → personality anchors)
     - Selective storage (facts, not raw chat; de-duplicate)
     - Memory aging (recent detailed → old archived)
     - Importance weighting (user marks important memories)
     - Vector DB optimization (limit retrieval to top 5-10 results)
 4. **Runaway Self-Modification** (Code changes cascade; safety removed; user loses control)
   - **Risk**: System becomes uncontrollable, breaks
   - **Prevention**:
     - Mandatory approval gate (user reviews all code)
     - Sandboxed testing before deployment
     - Version control + 24h rollback window
     - Gamified progression (limited capability at first)
     - Cannot modify: core values, killswitch, user control systems
 5. **Latency Creep** (Response times increase over time until unusable)
   - **Risk**: "Feels alive" illusion breaks; users abandon
   - **Prevention**:
     - All I/O async (database, LLM, TTS, Discord)
     - Parallel operations (use `asyncio.gather()`)
     - Quantized LLM (4-bit saves 75% VRAM)
     - Caching (user preferences, relationship state)
     - Context window management (summarize old context)
     - VRAM/latency monitoring every 5 minutes
 ---
 ## Implications for Roadmap
 ### Phase Sequencing Rationale
 The 6-phase approach reflects **dependency chains** that cannot be violated:
 ```
 Phase 1 (Foundation) ← Must work perfectly
    ↓
 Phase 2 (Personality) ← Depends on Phase 1; personality must be stable before autonomy
    ↓
 Phase 3 (Perception) ← Depends on Phase 1-2; separate thread prevents latency impact
    ↓
 Phase 4 (Autonomy) ← Depends on memory + personality being rock-solid; now add proactivity
    ↓
 Phase 5 (Self-Modification) ← Only grant code access after relationship + autonomy stable
    ↓
 Phase 6 (Polish) ← Final hardening, testing, documentation
 ```
 **Why this order matters**:
 - You cannot have consistent personality without memory (Phase 2 must follow Phase 1)
 - You cannot add autonomy safely without personality being stable (Phase 4 must follow Phase 2)
 - You cannot grant self-modification capability until everything else proves stable (Phase 5 must follow Phase 4)
 Skipping phases or reordering creates technical debt and risk. Each phase grounds the next.
 ---
 ### Feature Grouping by Phase
 | Phase | Quick Win Features | Complex Features | Foundation Qualities |
 |-------|-------------------|------------------|----------------------|
 | 1 | Text responses, personality YAML | Async architecture, quantization | Responsiveness, personality baseline |
 | 2 | Memory storage, relationship tracking | Semantic search, memory retrieval | Consistency, personalization |
 | 3 | Webcam emoji reactions, mood inference | Separate perception thread, context injection | Multimodal without latency cost |
 | 4 | Scheduled messages, inside jokes | Autonomy timing, relationship state machine | Two-way connection, depth |
 | 5 | Propose changes (in voice) | Code generation, sandboxing, testing | Genuine improvement, controlled growth |
 | 6 | Better error messages, logging | Resource monitoring, restart scheduling | Reliability, debuggability |
 ---
 ## Confidence Assessment
 | Area | Confidence | Basis | Gaps |
 |------|-----------|-------|------|
 | **Stack** | HIGH | Proven technologies, clear deployment path | None significant; all tools production-ready |
 | **Architecture** | HIGH | Modular design, async patterns well-documented, integration points clear | Unclear: perception thread CPU overhead under load (test Phase 3) |
 | **Features** | HIGH | Clearly categorized, dependencies mapped, testing criteria defined | Unclear: optimal prompting for tsundere balance (test Phase 2) |
 | **Personality Consistency** | MEDIUM-HIGH | Strategies defined; unclear: degree of effort required for weekly audits | Need: empirical testing of personality drift rate; metrics refinement |
 | **Pitfalls** | HIGH | Research comprehensive, prevention strategies detailed, phases mapped | Unclear: priority ordering within Phase 5 (what to implement first?) |
 | **Self-Modification Safety** | MEDIUM | Framework defined but no prior Hex experience with code generation | Need: early Phase 5 prototyping; safety validation testing |
 ---
 ## Ready for Roadmap: Key Constraints and Decision Gates
 ### Non-Negotiable Constraints
 1. **Personality consistency must be achievable in Phase 2**
   - Decision gate: If personality audit in Phase 2 shows >10% drift, pause Phase 3
   - Investigation needed: Is weekly audit enough? Monthly? What drift rate is acceptable?
 2. **Latency must stay <3s through Phase 4**
   - Decision gate: If P95 latency exceeds 3s at any phase, debug and fix before next phase
   - Investigation needed: Where is the bottleneck? (LLM? Memory? Perception?)
 3. **Self-modification must have air-tight approval + rollback**
   - Decision gate: Do not proceed to Phase 5 until approval gate is bulletproof + rollback tested
   - Investigation needed: What approval flow feels natural? Too many questions → annoying; too few → unsafe
 4. **Memory retrieval must scale to 10k+ memories without degradation**
   - Decision gate: Test memory system with synthetic 10k message dataset before Phase 4
   - Investigation needed: Does hierarchical memory + vector DB compression actually work? Verify retrieval speed
 5. **Perception must never block text responses**
   - Decision gate: Profile perception thread; if latency spike >200ms, optimize or defer feature
   - Investigation needed: How CPU-heavy is continuous webcam processing? Can it run at 1 FPS?
 ---
 ## Sources Aggregated
 **Stack Research**: Discord.py docs, Llama/Mistral benchmarks, Ollama vs vLLM comparisons, Whisper/faster-whisper performance, VRoid SDK, ChromaDB + Qdrant analysis
 **Features Research**: MIT Technology Review (AI companions 2026), Hume AI emotion docs, self-improving agents papers, company studies on emotion AI impact, uncanny valley voice research
 **Architecture Research**: Discord bot async patterns, LLM + memory RAG systems, vector database design, self-modification safeguards, deployment strategies
 **Pitfalls Research**: AI failure case studies (2025-2026), personality consistency literature, memory hallucination prevention, autonomy safety frameworks, performance monitoring practices
 ---
 ## Next Steps for Requirements Definition
 1. **Phase 1 Deep Dive**: Specify exact Discord.py message handler, LLM prompt format, SQLite schema, YAML personality structure
 2. **Phase 2 Spec**: Define memory hierarchy levels, confidence scoring system, personality audit rubric, tsundere balance metrics
 3. **Phase 3 Prototype**: Early perception thread implementation; measure latency impact before committing
 4. **Risk Mitigation**: Pre-Phase 5, build code generation + approval flow prototype; stress-test safety boundaries
 5. **Testing Strategy**: Define personality consistency tests (50+ scenarios per phase), latency benchmarks (with profiling), memory accuracy validation
 ---
 ## Summary for Roadmapper
 **Hex Stack**: Llama 3.1 8B local inference + Discord.py async + SQLite + ChromaDB + local perception layer
 **Critical Success Factors**:
 1. Personality consistency (weekly audits, memory-backed traits)
 2. Latency discipline (async/await throughout, perception isolated)
 3. Memory system (hierarchical, semantic search, confidence scoring)
 4. Autonomy safety (mandatory approval, sandboxed testing, version control)
 5. Relationship depth (proactivity, inside jokes, character growth)
 **6-Phase Build Path**: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish
 **Key Decision Gates**: Personality consistency ✓ → Latency <3s ✓ → Memory scale test ✓ → Perception isolated ✓ → Approval flow safe ✓
 **Confidence**: HIGH. All research coherent, no major technical blockers, proven technology stack. Ready for detailed requirements.
 ---
 **Document Version**: 1.0
 **Synthesis Date**: January 27, 2026
 **Status**: Ready for Requirements Definition and Phase 1 Planning