Files

Dani B d0a1ecfc3d docs: complete domain research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)

## Stack Analysis
- Llama 3.1 8B Instruct (128K context, 4-bit quantized)
- Discord.py 2.6.4+ async-native framework
- Ollama for local inference, ChromaDB for semantic memory
- Whisper Large V3 + Kokoro 82M (privacy-first speech)
- VRoid avatar + Discord screen share integration

## Architecture
- 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish
- Personality-first design; memory and consistency foundational
- All perception async (separate thread, never blocks responses)
- Self-modification sandboxed with mandatory user approval

## Critical Path
Phase 1: Core LLM + Discord integration + SQLite memory
Phase 2: Vector DB + personality versioning + consistency audits
Phase 3: Perception layer (webcam/screen, isolated thread)
Phase 4: Autonomy + relationship deepening + inside jokes
Phase 5: Self-modification capability (gamified, gated)
Phase 6: Production hardening + monitoring + scaling

## Key Pitfalls to Avoid
1. Personality drift (weekly consistency audits required)
2. Tsundere breaking (formalize denial rules; scale with relationship)
3. Memory bloat (hierarchical memory with archival)
4. Latency creep (async/await throughout; perception isolated)
5. Runaway self-modification (approval gates + rollback non-negotiable)

## Confidence
HIGH. Stack proven, architecture coherent, dependencies clear.
Ready for detailed requirements and Phase 1 planning.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-01-27 23:55:39 -05:00

23 KiB

Raw Blame History

Research Summary: Hex AI Companion

Date: January 2026 Status: Ready for Roadmap and Requirements Definition Confidence Level: HIGH (well-sourced, coherent across all research areas)

Executive Summary

Hex is built on a personality-first, local-first architecture that prioritizes genuine emotional resonance over feature breadth. The recommended approach combines Llama 3.1 8B (local inference via Ollama), Discord.py async patterns, and a dual-memory system (SQLite + ChromaDB) to create an AI companion that feels like a person with opinions and growth over time.

The technical foundation is solid and proven: Discord.py 2.6.4+ with native async support, local LLM inference for privacy, and a 6-phase incremental build strategy that enables personality emergence before adding autonomy or self-modification.

Critical success factor: The difference between "a bot that sounds like Hex" and "Hex as a person" hinges on three interconnected systems working together: memory persistence (so she learns about you), personality consistency (so she feels like the same person), and autonomy (so she feels genuinely invested in you). All three must be treated as foundational, not optional features.

Recommended Stack

Core Technologies (Production-ready, January 2026):

Layer	Technology	Version	Rationale
Bot Framework	Discord.py	2.6.4+	Async-native, mature, excellent Discord integration
LLM Inference	Llama 3.1 8B Instruct	4-bit quantized	128K context window, superior reasoning, 6GB VRAM footprint
LLM Engine	Ollama (dev) / vLLM (production)	0.3+	Local-first, zero setup vs high-throughput scaling
Short-term Memory	SQLite	Standard lib	Fast, reliable, local file-based conversations
Long-term Memory	ChromaDB (dev) → Qdrant (prod)	Latest	Vector semantics, embedded for <100k vectors
Embeddings	all-MiniLM-L6-v2	384-dim	Fast (5ms/sentence), production-grade quality
Speech-to-Text	Whisper Large V3 + faster-whisper	Latest	Local, 7.4% WER, multilingual, 3-5s latency
Text-to-Speech	Kokoro 82M (default) + XTTS-v2 (emotional)	Latest	Sub-second latency, personality-aware prosody
Vision	OpenCV 4.10+ + DeepFace	4.10+	Face detection (30 FPS), emotion recognition (90%+ accuracy)
Avatar	VRoid + VSeeFace + Discord screen share	Latest	Free, anime-style, integrates with Discord calls
Personality	YAML + Git versioning	—	Editable persona, change tracking, rollback capable
Self-Modification	RestrictedPython + sandboxing	—	Safe code generation, user approval required

Why This Stack:

Privacy: All inference local (except Discord API), no cloud dependency
Latency: <3 second end-to-end response time on consumer hardware (RTX 3060 Ti)
Cost: Zero cloud fees, open-source stack
Personality: System prompt injection + memory context + perception awareness enables genuine character coherence
Async Architecture: Discord.py's native asyncio means LLM, TTS, memory lookups run in parallel without blocking

Table Stakes vs Differentiators

Table Stakes (v1 Essential Features)

Users expect these by default in 2026. Missing any breaks immersion:

Conversation Memory (Short + Long-term)
- Last 20 messages in context window
- Vector semantic search for relevant past interactions
- Relationship state tracking (strangers → friends → close)
- Without this: Feels like meeting a stranger each time; companion becomes disposable
Natural Conversation (No AI Speak)
- Contractions, casual language, slang
- Personality quirks embedded in word choices
- Context-appropriate tone shifts
- Willingness to disagree or pushback
- Pitfall: Formal "I'm an AI and I can help you with..." kills immersion instantly
Fast Response Times (<1s for acknowledgment, <3s for full response)
- Typing indicators start immediately
- Streaming responses (show text as it generates)
- Async all I/O-bound work (LLM, TTS, database)
- Without this: Latency >5s makes companion feel dead; users stop engaging
Consistent Personality (Feels like same person across weeks)
- Core traits stable (tsundere nature, values)
- Personality evolution slow and logged
- Memory-backed traits (not just prompt)
- Pitfall: Personality drift is #1 reason users abandon companions
Platform Integration (Discord native)
- Text channels, DMs, voice channels
- Emoji reactions, slash commands
- Server-specific personality variations
- Without this: Requires leaving Discord = abandoned feature
Emotional Responsiveness (Reads the room)
- Sentiment detection from messages
- Adaptive response depth (listen to sad users, engage with energetic ones)
- Skip jokes when user is suffering
- Pitfall: "Always cheerful" feels cruel when user is venting

Differentiators (Competitive Edge)

These separate Hex from static chatbots. Build in order:

True Autonomy (Proactive Agency)
- Initiates conversations based on context/memory
- Reminds about user's goals without being asked
- Sets boundaries ("I don't think you should do X")
- Follows up on unresolved topics
- Research shows: Autonomous companions are described as "feels like they actually care" vs reactive "smart but distant"
- Complexity: Hard, requires Phase 3-4
Emotional Intelligence (Mood Detection + Adaptive Strategy)
- Facial emotion from webcam (70-80% accuracy possible)
- Voice tone analysis from Discord calls
- Mood tracking over time (identifies depression patterns, burnout)
- Knows when to listen vs advise vs distract
- Research shows: Companies using emotion AI report 25% positive sentiment increase
- Complexity: Hard, requires Phase 3+ but perception must be separate thread
Multimodal Awareness (Sees Your Context)
- Understands what's on your screen (game, work, video)
- Contextualizes help ("I see you're stuck on that Elden Ring boss...")
- Detects stress signals (tab behavior, timing)
- Proactive help based on visible activity
- Privacy: Local processing only, user opt-in required
- Complexity: Hard, requires careful async architecture to avoid latency
Self-Modification (Genuine Autonomy)
- Generates code to improve own logic
- Tests changes in sandbox before deployment
- User maintains veto power (approval required)
- All changes tracked with rollback capability
- Critical: Gamified progression (not instant capability), mandatory approval, version control
- Complexity: Hard, requires Phase 5+ and strong safety boundaries
Relationship Building (Transactional → Meaningful)
- Inside jokes that evolve naturally
- Character growth (admits mistakes, opinions change slightly)
- Vulnerability in appropriate moments
- Investment in user outcomes ("I'm rooting for you")
- Research shows: Users with relational companions feel like it's "someone who actually knows them"
- Complexity: Hard (3+ weeks), emerges from memory + personality + autonomy

Build Architecture (6-Phase Approach)

Phase 1: Foundation (Weeks 1-2) — "Hex talks back"

Goal: Core interaction loop working locally; personality emerges

Build:

Discord bot skeleton with message handling (Discord.py)
Local LLM integration (Ollama + Llama 3.1 8B 4-bit quantized)
SQLite conversation storage (recent context only)
YAML personality definition (editable)
System prompt with persona injection
Async/await patterns throughout

Outcomes:

Hex responds in Discord text channels with personality
Conversations logged, retrievable
Response latency <2 seconds
Personality can be tweaked via YAML

Key Metric: P95 latency <2s, personality consistency baseline established

Pitfalls to avoid:

Blocking operations on event loop (use asyncio.create_task())
LLM inference on main thread (use thread pool)
Personality not actionable in prompts (be specific about tsundere rules)

Phase 2: Personality & Memory (Weeks 3-4) — "Hex remembers me"

Goal: Hex feels like a person who learns about you; personality becomes consistent

Build:

Vector database (ChromaDB) for semantic memory
Memory-aware context injection (relevant past facts in prompt)
User relationship tracking (relationship state machine)
Emotional responsiveness from text sentiment
Personality versioning (git-based snapshots)
Tsundere balance metrics (track denial %)
Kid-mode detection (safety filtering)

Outcomes:

Hex remembers facts about you across conversations
Responses reference past events naturally
Personality consistent across weeks (audit shows <5% drift)
Emotions read from text; responses adapt depth
Changes to personality tracked with rollback

Key Metric: User reports "she remembers things I told her" unprompted

Pitfalls to avoid:

Personality drift (implement weekly consistency audits)
Memory hallucination (store full context, verify before using)
Tsundere breaking (formalize denial rules, scale with relationship phase)
Memory bloat (hierarchical memory with archival strategy)

Phase 3: Multimodal Input (Weeks 5-6) — "Hex sees me"

Goal: Add perception layer without killing responsiveness; context aware

Build:

Webcam integration (OpenCV face detection, DeepFace emotion)
Local Whisper for voice transcription in Discord calls
Screen capture analysis (activity recognition)
Perception state aggregation (emotion + activity + environment)
Context injection into LLM prompts
CRITICAL: Perception on separate thread (never blocks Discord responses)

Outcomes:

Hex reacts to your facial expressions
Voice input works in Discord calls
Responses reference your mood/activity
All processing local (privacy preserved)
Text latency unaffected by perception (<3s still achieved)

Key Metric: Multimodal doesn't increase response latency >500ms

Pitfalls to avoid:

Image processing blocking text responses (separate thread mandatory)
Processing every video frame (skip intelligently, 1-3 FPS sufficient)
Avatar sync failures (atomic state updates)
Privacy violations (no external transmission, user opt-in)

Phase 4: Avatar & Autonomy (Weeks 7-8) — "Hex has a face and cares"

Goal: Visual presence + proactive agency; relationship feels two-way

Build:

VRoid model loading + VSeeFace display
Blendshape animation (emotion → facial expression)
Discord screen share integration
Proactive messaging system (based on context/memory/mood)
Autonomy timing heuristics (don't interrupt at 3am)
Relationship state machine (escalates intimacy)
User preference learning (response length, topics, timing)

Outcomes:

Avatar appears in Discord calls, animates with mood
Hex initiates conversations ("Haven't heard from you in 3 days...")
Proactive messages feel relevant, not annoying
Relationship deepens (inside jokes, character growth)
User feels companionship, not just assistance

Key Metric: User reports missing Hex when unavailable; initiates conversations

Pitfalls to avoid:

Becoming annoying (emotional awareness + quiet mode essential)
One-way relationship (autonomy without care-signaling feels hollow)
Poor timing (learn user's schedule, respect busy periods)
Avatar desync (mood and expression must stay aligned)

Phase 5: Self-Modification (Weeks 9-10) — "Hex can improve herself"

Goal: Genuine autonomy within safety boundaries; code generation with approval gates

Build:

LLM-based code proposal generation
Static AST analysis for safety validation
Sandboxed testing environment
Git-based change tracking + rollback capability (24h window)
Gamified capability progression (5 levels)
Mandatory user approval for all changes
Personality updates when new capabilities unlock

Outcomes:

Hex proposes improvements (in voice, with reasoning)
Code changes tested, reviewed, deployed with approval
All changes reversible; version history intact
New capabilities unlock as relationship deepens
Hex "learns to code" and announces new skills

Key Metric: Self-modifications improve measurable aspects (faster response, better personality consistency)

Pitfalls to avoid:

Runaway self-modification (approval gate non-negotiable)
Code drift (version control mandatory, rollback tested)
Loss of user control (never remove safety constraints, killswitch always works)
Capability escalation without trust (gamified progression with clear boundaries)

Phase 6: Production Polish (Weeks 11-12) — "Hex is ready to ship"

Goal: Stability, performance, error handling, documentation

Build:

Performance optimization (caching, batching, context summarization)
Error handling + graceful degradation
Logging and telemetry (local + optional cloud)
Configuration management
Resource leak monitoring (memory, connections, VRAM)
Scheduled restart capability (weekly preventative)
Integration testing (all components together)
Documentation and guides
Auto-update capability

Outcomes:

System stable for indefinite uptime
Responsive under load
Clear error messages when things fail
Easy to deploy, configure, debug
Ready for extended real-world use

Key Metric: 99.5% uptime over 1-month runtime, no crashes, <3s latency maintained

Pitfalls to avoid:

Memory leaks (resource monitoring mandatory)
Performance degradation over time (profile early and often)
Context window bloat (summarization strategy)
Unforeseen edge cases (comprehensive testing)

Critical Pitfalls and Prevention

Top 5 Most Dangerous Pitfalls

Personality Drift (Consistency breaks over time)
- Risk: Users feel gaslighted; trust broken
- Prevention:
  - Weekly personality audits (sample responses, rate consistency)
  - Personality baseline document (core values never change)
  - Memory-backed personality (traits anchor to learned facts)
  - Version control on persona YAML (track evolution)
Tsundere Character Breaking (Denial applied wrong; becomes mean or loses charm)
- Risk: Character feels mechanical or rejecting
- Prevention:
  - Formalize denial rules: "deny only when (emotional AND not alone AND not escalated intimacy)"
  - Denial scales with relationship phase (90% early → 40% mature)
  - Post-denial must include care signal (action, not words)
  - Track denial %; alert if <30% (losing tsun) or >70% (too mean)
Memory System Bloat (Retrieval becomes slow; hallucinations increase)
- Risk: System becomes unusable as history grows
- Prevention:
  - Hierarchical memory (raw → summaries → semantic facts → personality anchors)
  - Selective storage (facts, not raw chat; de-duplicate)
  - Memory aging (recent detailed → old archived)
  - Importance weighting (user marks important memories)
  - Vector DB optimization (limit retrieval to top 5-10 results)
Runaway Self-Modification (Code changes cascade; safety removed; user loses control)
- Risk: System becomes uncontrollable, breaks
- Prevention:
  - Mandatory approval gate (user reviews all code)
  - Sandboxed testing before deployment
  - Version control + 24h rollback window
  - Gamified progression (limited capability at first)
  - Cannot modify: core values, killswitch, user control systems
Latency Creep (Response times increase over time until unusable)
- Risk: "Feels alive" illusion breaks; users abandon
- Prevention:
  - All I/O async (database, LLM, TTS, Discord)
  - Parallel operations (use asyncio.gather())
  - Quantized LLM (4-bit saves 75% VRAM)
  - Caching (user preferences, relationship state)
  - Context window management (summarize old context)
  - VRAM/latency monitoring every 5 minutes

Implications for Roadmap

Phase Sequencing Rationale

The 6-phase approach reflects dependency chains that cannot be violated:

Phase 1 (Foundation) ← Must work perfectly
    ↓
Phase 2 (Personality) ← Depends on Phase 1; personality must be stable before autonomy
    ↓
Phase 3 (Perception) ← Depends on Phase 1-2; separate thread prevents latency impact
    ↓
Phase 4 (Autonomy) ← Depends on memory + personality being rock-solid; now add proactivity
    ↓
Phase 5 (Self-Modification) ← Only grant code access after relationship + autonomy stable
    ↓
Phase 6 (Polish) ← Final hardening, testing, documentation

Why this order matters:

You cannot have consistent personality without memory (Phase 2 must follow Phase 1)
You cannot add autonomy safely without personality being stable (Phase 4 must follow Phase 2)
You cannot grant self-modification capability until everything else proves stable (Phase 5 must follow Phase 4)

Skipping phases or reordering creates technical debt and risk. Each phase grounds the next.

Feature Grouping by Phase

Phase	Quick Win Features	Complex Features	Foundation Qualities
1	Text responses, personality YAML	Async architecture, quantization	Responsiveness, personality baseline
2	Memory storage, relationship tracking	Semantic search, memory retrieval	Consistency, personalization
3	Webcam emoji reactions, mood inference	Separate perception thread, context injection	Multimodal without latency cost
4	Scheduled messages, inside jokes	Autonomy timing, relationship state machine	Two-way connection, depth
5	Propose changes (in voice)	Code generation, sandboxing, testing	Genuine improvement, controlled growth
6	Better error messages, logging	Resource monitoring, restart scheduling	Reliability, debuggability

Confidence Assessment

Area	Confidence	Basis	Gaps
Stack	HIGH	Proven technologies, clear deployment path	None significant; all tools production-ready
Architecture	HIGH	Modular design, async patterns well-documented, integration points clear	Unclear: perception thread CPU overhead under load (test Phase 3)
Features	HIGH	Clearly categorized, dependencies mapped, testing criteria defined	Unclear: optimal prompting for tsundere balance (test Phase 2)
Personality Consistency	MEDIUM-HIGH	Strategies defined; unclear: degree of effort required for weekly audits	Need: empirical testing of personality drift rate; metrics refinement
Pitfalls	HIGH	Research comprehensive, prevention strategies detailed, phases mapped	Unclear: priority ordering within Phase 5 (what to implement first?)
Self-Modification Safety	MEDIUM	Framework defined but no prior Hex experience with code generation	Need: early Phase 5 prototyping; safety validation testing

Ready for Roadmap: Key Constraints and Decision Gates

Non-Negotiable Constraints

Personality consistency must be achievable in Phase 2
- Decision gate: If personality audit in Phase 2 shows >10% drift, pause Phase 3
- Investigation needed: Is weekly audit enough? Monthly? What drift rate is acceptable?
Latency must stay <3s through Phase 4
- Decision gate: If P95 latency exceeds 3s at any phase, debug and fix before next phase
- Investigation needed: Where is the bottleneck? (LLM? Memory? Perception?)
Self-modification must have air-tight approval + rollback
- Decision gate: Do not proceed to Phase 5 until approval gate is bulletproof + rollback tested
- Investigation needed: What approval flow feels natural? Too many questions → annoying; too few → unsafe
Memory retrieval must scale to 10k+ memories without degradation
- Decision gate: Test memory system with synthetic 10k message dataset before Phase 4
- Investigation needed: Does hierarchical memory + vector DB compression actually work? Verify retrieval speed
Perception must never block text responses
- Decision gate: Profile perception thread; if latency spike >200ms, optimize or defer feature
- Investigation needed: How CPU-heavy is continuous webcam processing? Can it run at 1 FPS?

Sources Aggregated

Stack Research: Discord.py docs, Llama/Mistral benchmarks, Ollama vs vLLM comparisons, Whisper/faster-whisper performance, VRoid SDK, ChromaDB + Qdrant analysis

Features Research: MIT Technology Review (AI companions 2026), Hume AI emotion docs, self-improving agents papers, company studies on emotion AI impact, uncanny valley voice research

Architecture Research: Discord bot async patterns, LLM + memory RAG systems, vector database design, self-modification safeguards, deployment strategies

Pitfalls Research: AI failure case studies (2025-2026), personality consistency literature, memory hallucination prevention, autonomy safety frameworks, performance monitoring practices

Next Steps for Requirements Definition

Phase 1 Deep Dive: Specify exact Discord.py message handler, LLM prompt format, SQLite schema, YAML personality structure
Phase 2 Spec: Define memory hierarchy levels, confidence scoring system, personality audit rubric, tsundere balance metrics
Phase 3 Prototype: Early perception thread implementation; measure latency impact before committing
Risk Mitigation: Pre-Phase 5, build code generation + approval flow prototype; stress-test safety boundaries
Testing Strategy: Define personality consistency tests (50+ scenarios per phase), latency benchmarks (with profiling), memory accuracy validation

Summary for Roadmapper

Hex Stack: Llama 3.1 8B local inference + Discord.py async + SQLite + ChromaDB + local perception layer

Critical Success Factors:

Personality consistency (weekly audits, memory-backed traits)
Latency discipline (async/await throughout, perception isolated)
Memory system (hierarchical, semantic search, confidence scoring)
Autonomy safety (mandatory approval, sandboxed testing, version control)
Relationship depth (proactivity, inside jokes, character growth)

6-Phase Build Path: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish

Key Decision Gates: Personality consistency ✓ → Latency <3s ✓ → Memory scale test ✓ → Perception isolated ✓ → Approval flow safe ✓

Confidence: HIGH. All research coherent, no major technical blockers, proven technology stack. Ready for detailed requirements.

Document Version: 1.0 Synthesis Date: January 27, 2026 Status: Ready for Requirements Definition and Phase 1 Planning

23 KiB Raw Blame History

Research Summary: Hex AI Companion

Executive Summary

Recommended Stack

Table Stakes vs Differentiators

Table Stakes (v1 Essential Features)

Differentiators (Competitive Edge)

Build Architecture (6-Phase Approach)

Phase 1: Foundation (Weeks 1-2) — "Hex talks back"

Phase 2: Personality & Memory (Weeks 3-4) — "Hex remembers me"

Phase 3: Multimodal Input (Weeks 5-6) — "Hex sees me"

Phase 4: Avatar & Autonomy (Weeks 7-8) — "Hex has a face and cares"

Phase 5: Self-Modification (Weeks 9-10) — "Hex can improve herself"

Phase 6: Production Polish (Weeks 11-12) — "Hex is ready to ship"

Critical Pitfalls and Prevention

Top 5 Most Dangerous Pitfalls

Implications for Roadmap

Phase Sequencing Rationale

Feature Grouping by Phase

Confidence Assessment

Ready for Roadmap: Key Constraints and Decision Gates

Non-Negotiable Constraints

Sources Aggregated

Next Steps for Requirements Definition

Summary for Roadmapper

23 KiB

Raw Blame History