# Research Summary: Hex AI Companion **Date**: January 2026 **Status**: Ready for Roadmap and Requirements Definition **Confidence Level**: HIGH (well-sourced, coherent across all research areas) --- ## Executive Summary Hex is built on a **personality-first, local-first architecture** that prioritizes genuine emotional resonance over feature breadth. The recommended approach combines Llama 3.1 8B (local inference via Ollama), Discord.py async patterns, and a dual-memory system (SQLite + ChromaDB) to create an AI companion that feels like a person with opinions and growth over time. The technical foundation is solid and proven: Discord.py 2.6.4+ with native async support, local LLM inference for privacy, and a 6-phase incremental build strategy that enables personality emergence before adding autonomy or self-modification. **Critical success factor**: The difference between "a bot that sounds like Hex" and "Hex as a person" hinges on three interconnected systems working together: **memory persistence** (so she learns about you), **personality consistency** (so she feels like the same person), and **autonomy** (so she feels genuinely invested in you). All three must be treated as foundational, not optional features. --- ## Recommended Stack **Core Technologies** (Production-ready, January 2026): | Layer | Technology | Version | Rationale | |-------|-----------|---------|-----------| | **Bot Framework** | Discord.py | 2.6.4+ | Async-native, mature, excellent Discord integration | | **LLM Inference** | Llama 3.1 8B Instruct | 4-bit quantized | 128K context window, superior reasoning, 6GB VRAM footprint | | **LLM Engine** | Ollama (dev) / vLLM (production) | 0.3+ | Local-first, zero setup vs high-throughput scaling | | **Short-term Memory** | SQLite | Standard lib | Fast, reliable, local file-based conversations | | **Long-term Memory** | ChromaDB (dev) → Qdrant (prod) | Latest | Vector semantics, embedded for <100k vectors | | **Embeddings** | all-MiniLM-L6-v2 | 384-dim | Fast (5ms/sentence), production-grade quality | | **Speech-to-Text** | Whisper Large V3 + faster-whisper | Latest | Local, 7.4% WER, multilingual, 3-5s latency | | **Text-to-Speech** | Kokoro 82M (default) + XTTS-v2 (emotional) | Latest | Sub-second latency, personality-aware prosody | | **Vision** | OpenCV 4.10+ + DeepFace | 4.10+ | Face detection (30 FPS), emotion recognition (90%+ accuracy) | | **Avatar** | VRoid + VSeeFace + Discord screen share | Latest | Free, anime-style, integrates with Discord calls | | **Personality** | YAML + Git versioning | — | Editable persona, change tracking, rollback capable | | **Self-Modification** | RestrictedPython + sandboxing | — | Safe code generation, user approval required | **Why This Stack**: - **Privacy**: All inference local (except Discord API), no cloud dependency - **Latency**: <3 second end-to-end response time on consumer hardware (RTX 3060 Ti) - **Cost**: Zero cloud fees, open-source stack - **Personality**: System prompt injection + memory context + perception awareness enables genuine character coherence - **Async Architecture**: Discord.py's native asyncio means LLM, TTS, memory lookups run in parallel without blocking --- ## Table Stakes vs Differentiators ### Table Stakes (v1 Essential Features) Users expect these by default in 2026. Missing any breaks immersion: 1. **Conversation Memory** (Short + Long-term) - Last 20 messages in context window - Vector semantic search for relevant past interactions - Relationship state tracking (strangers → friends → close) - **Without this**: Feels like meeting a stranger each time; companion becomes disposable 2. **Natural Conversation** (No AI Speak) - Contractions, casual language, slang - Personality quirks embedded in word choices - Context-appropriate tone shifts - Willingness to disagree or pushback - **Pitfall**: Formal "I'm an AI and I can help you with..." kills immersion instantly 3. **Fast Response Times** (<1s for acknowledgment, <3s for full response) - Typing indicators start immediately - Streaming responses (show text as it generates) - Async all I/O-bound work (LLM, TTS, database) - **Without this**: Latency >5s makes companion feel dead; users stop engaging 4. **Consistent Personality** (Feels like same person across weeks) - Core traits stable (tsundere nature, values) - Personality evolution slow and logged - Memory-backed traits (not just prompt) - **Pitfall**: Personality drift is #1 reason users abandon companions 5. **Platform Integration** (Discord native) - Text channels, DMs, voice channels - Emoji reactions, slash commands - Server-specific personality variations - **Without this**: Requires leaving Discord = abandoned feature 6. **Emotional Responsiveness** (Reads the room) - Sentiment detection from messages - Adaptive response depth (listen to sad users, engage with energetic ones) - Skip jokes when user is suffering - **Pitfall**: "Always cheerful" feels cruel when user is venting --- ### Differentiators (Competitive Edge) These separate Hex from static chatbots. Build in order: 1. **True Autonomy** (Proactive Agency) - Initiates conversations based on context/memory - Reminds about user's goals without being asked - Sets boundaries ("I don't think you should do X") - Follows up on unresolved topics - **Research shows**: Autonomous companions are described as "feels like they actually care" vs reactive "smart but distant" - **Complexity**: Hard, requires Phase 3-4 2. **Emotional Intelligence** (Mood Detection + Adaptive Strategy) - Facial emotion from webcam (70-80% accuracy possible) - Voice tone analysis from Discord calls - Mood tracking over time (identifies depression patterns, burnout) - Knows when to listen vs advise vs distract - **Research shows**: Companies using emotion AI report 25% positive sentiment increase - **Complexity**: Hard, requires Phase 3+ but perception must be separate thread 3. **Multimodal Awareness** (Sees Your Context) - Understands what's on your screen (game, work, video) - Contextualizes help ("I see you're stuck on that Elden Ring boss...") - Detects stress signals (tab behavior, timing) - Proactive help based on visible activity - **Privacy**: Local processing only, user opt-in required - **Complexity**: Hard, requires careful async architecture to avoid latency 4. **Self-Modification** (Genuine Autonomy) - Generates code to improve own logic - Tests changes in sandbox before deployment - User maintains veto power (approval required) - All changes tracked with rollback capability - **Critical**: Gamified progression (not instant capability), mandatory approval, version control - **Complexity**: Hard, requires Phase 5+ and strong safety boundaries 5. **Relationship Building** (Transactional → Meaningful) - Inside jokes that evolve naturally - Character growth (admits mistakes, opinions change slightly) - Vulnerability in appropriate moments - Investment in user outcomes ("I'm rooting for you") - **Research shows**: Users with relational companions feel like it's "someone who actually knows them" - **Complexity**: Hard (3+ weeks), emerges from memory + personality + autonomy --- ## Build Architecture (6-Phase Approach) ### Phase 1: Foundation (Weeks 1-2) — "Hex talks back" **Goal**: Core interaction loop working locally; personality emerges **Build**: - Discord bot skeleton with message handling (Discord.py) - Local LLM integration (Ollama + Llama 3.1 8B 4-bit quantized) - SQLite conversation storage (recent context only) - YAML personality definition (editable) - System prompt with persona injection - Async/await patterns throughout **Outcomes**: - Hex responds in Discord text channels with personality - Conversations logged, retrievable - Response latency <2 seconds - Personality can be tweaked via YAML **Key Metric**: P95 latency <2s, personality consistency baseline established **Pitfalls to avoid**: - Blocking operations on event loop (use `asyncio.create_task()`) - LLM inference on main thread (use thread pool) - Personality not actionable in prompts (be specific about tsundere rules) --- ### Phase 2: Personality & Memory (Weeks 3-4) — "Hex remembers me" **Goal**: Hex feels like a person who learns about you; personality becomes consistent **Build**: - Vector database (ChromaDB) for semantic memory - Memory-aware context injection (relevant past facts in prompt) - User relationship tracking (relationship state machine) - Emotional responsiveness from text sentiment - Personality versioning (git-based snapshots) - Tsundere balance metrics (track denial %) - Kid-mode detection (safety filtering) **Outcomes**: - Hex remembers facts about you across conversations - Responses reference past events naturally - Personality consistent across weeks (audit shows <5% drift) - Emotions read from text; responses adapt depth - Changes to personality tracked with rollback **Key Metric**: User reports "she remembers things I told her" unprompted **Pitfalls to avoid**: - Personality drift (implement weekly consistency audits) - Memory hallucination (store full context, verify before using) - Tsundere breaking (formalize denial rules, scale with relationship phase) - Memory bloat (hierarchical memory with archival strategy) --- ### Phase 3: Multimodal Input (Weeks 5-6) — "Hex sees me" **Goal**: Add perception layer without killing responsiveness; context aware **Build**: - Webcam integration (OpenCV face detection, DeepFace emotion) - Local Whisper for voice transcription in Discord calls - Screen capture analysis (activity recognition) - Perception state aggregation (emotion + activity + environment) - Context injection into LLM prompts - **CRITICAL**: Perception on separate thread (never blocks Discord responses) **Outcomes**: - Hex reacts to your facial expressions - Voice input works in Discord calls - Responses reference your mood/activity - All processing local (privacy preserved) - Text latency unaffected by perception (<3s still achieved) **Key Metric**: Multimodal doesn't increase response latency >500ms **Pitfalls to avoid**: - Image processing blocking text responses (separate thread mandatory) - Processing every video frame (skip intelligently, 1-3 FPS sufficient) - Avatar sync failures (atomic state updates) - Privacy violations (no external transmission, user opt-in) --- ### Phase 4: Avatar & Autonomy (Weeks 7-8) — "Hex has a face and cares" **Goal**: Visual presence + proactive agency; relationship feels two-way **Build**: - VRoid model loading + VSeeFace display - Blendshape animation (emotion → facial expression) - Discord screen share integration - Proactive messaging system (based on context/memory/mood) - Autonomy timing heuristics (don't interrupt at 3am) - Relationship state machine (escalates intimacy) - User preference learning (response length, topics, timing) **Outcomes**: - Avatar appears in Discord calls, animates with mood - Hex initiates conversations ("Haven't heard from you in 3 days...") - Proactive messages feel relevant, not annoying - Relationship deepens (inside jokes, character growth) - User feels companionship, not just assistance **Key Metric**: User reports missing Hex when unavailable; initiates conversations **Pitfalls to avoid**: - Becoming annoying (emotional awareness + quiet mode essential) - One-way relationship (autonomy without care-signaling feels hollow) - Poor timing (learn user's schedule, respect busy periods) - Avatar desync (mood and expression must stay aligned) --- ### Phase 5: Self-Modification (Weeks 9-10) — "Hex can improve herself" **Goal**: Genuine autonomy within safety boundaries; code generation with approval gates **Build**: - LLM-based code proposal generation - Static AST analysis for safety validation - Sandboxed testing environment - Git-based change tracking + rollback capability (24h window) - Gamified capability progression (5 levels) - Mandatory user approval for all changes - Personality updates when new capabilities unlock **Outcomes**: - Hex proposes improvements (in voice, with reasoning) - Code changes tested, reviewed, deployed with approval - All changes reversible; version history intact - New capabilities unlock as relationship deepens - Hex "learns to code" and announces new skills **Key Metric**: Self-modifications improve measurable aspects (faster response, better personality consistency) **Pitfalls to avoid**: - Runaway self-modification (approval gate non-negotiable) - Code drift (version control mandatory, rollback tested) - Loss of user control (never remove safety constraints, killswitch always works) - Capability escalation without trust (gamified progression with clear boundaries) --- ### Phase 6: Production Polish (Weeks 11-12) — "Hex is ready to ship" **Goal**: Stability, performance, error handling, documentation **Build**: - Performance optimization (caching, batching, context summarization) - Error handling + graceful degradation - Logging and telemetry (local + optional cloud) - Configuration management - Resource leak monitoring (memory, connections, VRAM) - Scheduled restart capability (weekly preventative) - Integration testing (all components together) - Documentation and guides - Auto-update capability **Outcomes**: - System stable for indefinite uptime - Responsive under load - Clear error messages when things fail - Easy to deploy, configure, debug - Ready for extended real-world use **Key Metric**: 99.5% uptime over 1-month runtime, no crashes, <3s latency maintained **Pitfalls to avoid**: - Memory leaks (resource monitoring mandatory) - Performance degradation over time (profile early and often) - Context window bloat (summarization strategy) - Unforeseen edge cases (comprehensive testing) --- ## Critical Pitfalls and Prevention ### Top 5 Most Dangerous Pitfalls 1. **Personality Drift** (Consistency breaks over time) - **Risk**: Users feel gaslighted; trust broken - **Prevention**: - Weekly personality audits (sample responses, rate consistency) - Personality baseline document (core values never change) - Memory-backed personality (traits anchor to learned facts) - Version control on persona YAML (track evolution) 2. **Tsundere Character Breaking** (Denial applied wrong; becomes mean or loses charm) - **Risk**: Character feels mechanical or rejecting - **Prevention**: - Formalize denial rules: "deny only when (emotional AND not alone AND not escalated intimacy)" - Denial scales with relationship phase (90% early → 40% mature) - Post-denial must include care signal (action, not words) - Track denial %; alert if <30% (losing tsun) or >70% (too mean) 3. **Memory System Bloat** (Retrieval becomes slow; hallucinations increase) - **Risk**: System becomes unusable as history grows - **Prevention**: - Hierarchical memory (raw → summaries → semantic facts → personality anchors) - Selective storage (facts, not raw chat; de-duplicate) - Memory aging (recent detailed → old archived) - Importance weighting (user marks important memories) - Vector DB optimization (limit retrieval to top 5-10 results) 4. **Runaway Self-Modification** (Code changes cascade; safety removed; user loses control) - **Risk**: System becomes uncontrollable, breaks - **Prevention**: - Mandatory approval gate (user reviews all code) - Sandboxed testing before deployment - Version control + 24h rollback window - Gamified progression (limited capability at first) - Cannot modify: core values, killswitch, user control systems 5. **Latency Creep** (Response times increase over time until unusable) - **Risk**: "Feels alive" illusion breaks; users abandon - **Prevention**: - All I/O async (database, LLM, TTS, Discord) - Parallel operations (use `asyncio.gather()`) - Quantized LLM (4-bit saves 75% VRAM) - Caching (user preferences, relationship state) - Context window management (summarize old context) - VRAM/latency monitoring every 5 minutes --- ## Implications for Roadmap ### Phase Sequencing Rationale The 6-phase approach reflects **dependency chains** that cannot be violated: ``` Phase 1 (Foundation) ← Must work perfectly ↓ Phase 2 (Personality) ← Depends on Phase 1; personality must be stable before autonomy ↓ Phase 3 (Perception) ← Depends on Phase 1-2; separate thread prevents latency impact ↓ Phase 4 (Autonomy) ← Depends on memory + personality being rock-solid; now add proactivity ↓ Phase 5 (Self-Modification) ← Only grant code access after relationship + autonomy stable ↓ Phase 6 (Polish) ← Final hardening, testing, documentation ``` **Why this order matters**: - You cannot have consistent personality without memory (Phase 2 must follow Phase 1) - You cannot add autonomy safely without personality being stable (Phase 4 must follow Phase 2) - You cannot grant self-modification capability until everything else proves stable (Phase 5 must follow Phase 4) Skipping phases or reordering creates technical debt and risk. Each phase grounds the next. --- ### Feature Grouping by Phase | Phase | Quick Win Features | Complex Features | Foundation Qualities | |-------|-------------------|------------------|----------------------| | 1 | Text responses, personality YAML | Async architecture, quantization | Responsiveness, personality baseline | | 2 | Memory storage, relationship tracking | Semantic search, memory retrieval | Consistency, personalization | | 3 | Webcam emoji reactions, mood inference | Separate perception thread, context injection | Multimodal without latency cost | | 4 | Scheduled messages, inside jokes | Autonomy timing, relationship state machine | Two-way connection, depth | | 5 | Propose changes (in voice) | Code generation, sandboxing, testing | Genuine improvement, controlled growth | | 6 | Better error messages, logging | Resource monitoring, restart scheduling | Reliability, debuggability | --- ## Confidence Assessment | Area | Confidence | Basis | Gaps | |------|-----------|-------|------| | **Stack** | HIGH | Proven technologies, clear deployment path | None significant; all tools production-ready | | **Architecture** | HIGH | Modular design, async patterns well-documented, integration points clear | Unclear: perception thread CPU overhead under load (test Phase 3) | | **Features** | HIGH | Clearly categorized, dependencies mapped, testing criteria defined | Unclear: optimal prompting for tsundere balance (test Phase 2) | | **Personality Consistency** | MEDIUM-HIGH | Strategies defined; unclear: degree of effort required for weekly audits | Need: empirical testing of personality drift rate; metrics refinement | | **Pitfalls** | HIGH | Research comprehensive, prevention strategies detailed, phases mapped | Unclear: priority ordering within Phase 5 (what to implement first?) | | **Self-Modification Safety** | MEDIUM | Framework defined but no prior Hex experience with code generation | Need: early Phase 5 prototyping; safety validation testing | --- ## Ready for Roadmap: Key Constraints and Decision Gates ### Non-Negotiable Constraints 1. **Personality consistency must be achievable in Phase 2** - Decision gate: If personality audit in Phase 2 shows >10% drift, pause Phase 3 - Investigation needed: Is weekly audit enough? Monthly? What drift rate is acceptable? 2. **Latency must stay <3s through Phase 4** - Decision gate: If P95 latency exceeds 3s at any phase, debug and fix before next phase - Investigation needed: Where is the bottleneck? (LLM? Memory? Perception?) 3. **Self-modification must have air-tight approval + rollback** - Decision gate: Do not proceed to Phase 5 until approval gate is bulletproof + rollback tested - Investigation needed: What approval flow feels natural? Too many questions → annoying; too few → unsafe 4. **Memory retrieval must scale to 10k+ memories without degradation** - Decision gate: Test memory system with synthetic 10k message dataset before Phase 4 - Investigation needed: Does hierarchical memory + vector DB compression actually work? Verify retrieval speed 5. **Perception must never block text responses** - Decision gate: Profile perception thread; if latency spike >200ms, optimize or defer feature - Investigation needed: How CPU-heavy is continuous webcam processing? Can it run at 1 FPS? --- ## Sources Aggregated **Stack Research**: Discord.py docs, Llama/Mistral benchmarks, Ollama vs vLLM comparisons, Whisper/faster-whisper performance, VRoid SDK, ChromaDB + Qdrant analysis **Features Research**: MIT Technology Review (AI companions 2026), Hume AI emotion docs, self-improving agents papers, company studies on emotion AI impact, uncanny valley voice research **Architecture Research**: Discord bot async patterns, LLM + memory RAG systems, vector database design, self-modification safeguards, deployment strategies **Pitfalls Research**: AI failure case studies (2025-2026), personality consistency literature, memory hallucination prevention, autonomy safety frameworks, performance monitoring practices --- ## Next Steps for Requirements Definition 1. **Phase 1 Deep Dive**: Specify exact Discord.py message handler, LLM prompt format, SQLite schema, YAML personality structure 2. **Phase 2 Spec**: Define memory hierarchy levels, confidence scoring system, personality audit rubric, tsundere balance metrics 3. **Phase 3 Prototype**: Early perception thread implementation; measure latency impact before committing 4. **Risk Mitigation**: Pre-Phase 5, build code generation + approval flow prototype; stress-test safety boundaries 5. **Testing Strategy**: Define personality consistency tests (50+ scenarios per phase), latency benchmarks (with profiling), memory accuracy validation --- ## Summary for Roadmapper **Hex Stack**: Llama 3.1 8B local inference + Discord.py async + SQLite + ChromaDB + local perception layer **Critical Success Factors**: 1. Personality consistency (weekly audits, memory-backed traits) 2. Latency discipline (async/await throughout, perception isolated) 3. Memory system (hierarchical, semantic search, confidence scoring) 4. Autonomy safety (mandatory approval, sandboxed testing, version control) 5. Relationship depth (proactivity, inside jokes, character growth) **6-Phase Build Path**: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish **Key Decision Gates**: Personality consistency ✓ → Latency <3s ✓ → Memory scale test ✓ → Perception isolated ✓ → Approval flow safe ✓ **Confidence**: HIGH. All research coherent, no major technical blockers, proven technology stack. Ready for detailed requirements. --- **Document Version**: 1.0 **Synthesis Date**: January 27, 2026 **Status**: Ready for Requirements Definition and Phase 1 Planning