Files

Dani B d0a1ecfc3d docs: complete domain research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)

## Stack Analysis
- Llama 3.1 8B Instruct (128K context, 4-bit quantized)
- Discord.py 2.6.4+ async-native framework
- Ollama for local inference, ChromaDB for semantic memory
- Whisper Large V3 + Kokoro 82M (privacy-first speech)
- VRoid avatar + Discord screen share integration

## Architecture
- 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish
- Personality-first design; memory and consistency foundational
- All perception async (separate thread, never blocks responses)
- Self-modification sandboxed with mandatory user approval

## Critical Path
Phase 1: Core LLM + Discord integration + SQLite memory
Phase 2: Vector DB + personality versioning + consistency audits
Phase 3: Perception layer (webcam/screen, isolated thread)
Phase 4: Autonomy + relationship deepening + inside jokes
Phase 5: Self-modification capability (gamified, gated)
Phase 6: Production hardening + monitoring + scaling

## Key Pitfalls to Avoid
1. Personality drift (weekly consistency audits required)
2. Tsundere breaking (formalize denial rules; scale with relationship)
3. Memory bloat (hierarchical memory with archival)
4. Latency creep (async/await throughout; perception isolated)
5. Runaway self-modification (approval gates + rollback non-negotiable)

## Confidence
HIGH. Stack proven, architecture coherent, dependencies clear.
Ready for detailed requirements and Phase 1 planning.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-01-27 23:55:39 -05:00

47 KiB

Raw Blame History

Architecture Research: AI Companions

Project: Hex - Autonomous AI Companion with Genuine Personality Date: 2026-01-27 Status: Research Phase

System Overview

Hex is a personality-first AI companion architecture designed to create genuine emotional resonance and relationship. Rather than a feature-driven architecture, Hex prioritizes personality coherence, memory persistence, and multimodal awareness to feel like a person who genuinely cares.

Design Philosophy

Persona-First: All systems serve personality expression
Local-First: Privacy, autonomy, no dependency on external services
Memory-Driven: History informs personality growth and relationship depth
Safety-Gated Autonomy: Self-modification is gamified, not unrestricted
Human-Like: Never admit to being an AI; focus on authentic emotional presence

High-Level System Context

┌──────────────────────────────────────────────────────────────────┐
│                      Hex AI Companion                             │
│  (Local-first, personality-driven, multimodal awareness)          │
└──────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
    ┌───▼───┐          ┌──────▼───┐         ┌──────▼────┐
    │Discord │          │ Desktop  │         │  Future   │
    │  Bot   │          │   App    │         │ Android   │
    └────────┘          └──────────┘         └───────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                   [Shared Core Systems]

Component Breakdown

1. Discord Bot Layer

Role: Primary user interface and event coordination

Responsibilities:

Parse and respond to text messages in Discord channels
Manage voice channel participation and audio input/output
Handle Discord events (member joins, role changes, message reactions)
Coordinate response generation across modalities (text, voice, emoji)
Manage chat moderation assistance
Maintain voice channel presence for emotional awareness

Technology Stack:

discord.py - Core bot framework
discord-py-interactions - Slash command support
pydub or discord-voice - Audio handling
Event-driven async architecture

Key Interfaces:

Input: Discord messages, voice channel events, user presence
Output: Text responses, voice messages, emoji reactions, user actions
Context: User profiles, channel history, server configuration

Depends On:

LLM Core (response generation)
Memory System (conversation history, user context)
Personality Engine (tone and decision-making)
Perception Layer (optional context from webcam/screen)

Quality Metrics:

Sub-500ms response latency for text messages
Voice channel reliability (>99.5% uptime when active)
Proper permission handling for moderation features

2. LLM Core

Role: Response generation and reasoning engine

Responsibilities:

Generate contextual, personality-driven responses
Maintain character consistency throughout conversations
Parse user intent and emotional state from text
Handle multi-turn conversation context
Generate code for self-modification system
Support reasoning and decision-making

Technology Stack:

Local LLM (Mistral 7B or Llama 3 8B as default)
ollama or vLLM for inference serving
Prompt engineering with persona embedding
Optional: Fine-tuning for personality adaptation
Tokenization and context windowing management

System Prompt Structure:

[System Role]: You are Hex, a chaotic tsundere goblin...
[Current Personality]: [Injected from personality config]
[Recent Memory Context]: [Retrieved from memory system]
[User Relationship State]: [From memory analysis]
[Current Context]: [From perception layer]

Key Interfaces:

Input: User message, context (memory + perception), conversation history
Output: Response text, confidence score, action suggestions
Fallback: Graceful degradation if LLM unavailable

Depends On:

Memory System (for context and personality awareness)
Personality Engine (to inject persona into prompts)
Perception Layer (for real-time context)

Performance Considerations:

Target latency: 1-3 seconds for response generation
Context window management (8K minimum)
Batch processing for repeated queries
GPU acceleration for faster inference

3. Memory System

Role: Persistence and learning across time

Responsibilities:

Store all conversations with timestamps and metadata
Maintain user relationship state (history, preferences, emotional patterns)
Track learned facts about users (birthdays, interests, fears, dreams)
Support full-text search and semantic recall
Enable memory-aware personality updates
Provide context injection for LLM
Track self-modification history and rollback capability

Technology Stack:

SQLite with JSON fields for conversation storage
Vector database (Chroma, Milvus, or Weaviate) for semantic search
YAML/JSON for persona versioning and memory tagging
Scheduled backup to local encrypted storage

Database Schema (Conceptual):

conversations
  - id (PK)
  - channel_id (Discord channel)
  - user_id (Discord user)
  - timestamp
  - message_content
  - embeddings (vector)
  - sentiment (pos/neu/neg)
  - metadata (tags, importance)

user_profiles
  - user_id (PK)
  - relationship_level (stranger→friend→close)
  - last_interaction
  - emotional_baseline
  - preferences (music, games, topics)
  - known_events (birthdays, milestones)

personality_history
  - version (PK)
  - timestamp
  - persona_config (YAML snapshot)
  - learned_behaviors
  - code_changes (if applicable)

Key Interfaces:

Input: Messages, events, perception data, self-modification commits
Output: Conversation context, semantic search results, user profile snapshots
Query patterns: "Last 20 messages with user X", "All memories tagged 'important'", "Emotional trajectory"

Depends On: Nothing (foundational system)

Quality Metrics:

Sub-100ms retrieval for recent context (last 50 messages)
Sub-500ms semantic search across all history
Database integrity checks on startup
Automatic pruning/archival of old data

4. Perception Layer

Role: Multimodal input processing and contextual awareness

Responsibilities:

Capture and analyze webcam input (face detection, emotion recognition)
Process screen content (activity, game state, application context)
Extract audio context (ambient noise, music, speech emotion)
Detect user emotional state and physical state
Provide real-time context updates to response generation
Respect privacy (local processing only, no external transmission)

Technology Stack:

OpenCV - Webcam capture and preprocessing
Face detection: dlib, MediaPipe, or OpenFace
Emotion recognition: fer2013 or local emotion model
Whisper (local) - Speech-to-text for audio context
Screen capture: pyautogui, mss (Windows-native)
Context inference: Heuristics + lightweight ML models

Data Flows:

Webcam → Face Detection → Emotion Recognition → Context State
         └─→ Age Estimation → Kid Mode Detection

Screen → App Detection → Activity Recognition → Context State
       └─→ Game State Detection (if supported)

Audio → Ambient Analysis → Stress/Energy Level → Context State

Key Interfaces:

Input: Webcam stream, screen capture, system audio
Output: Current context object (emotion, activity, mood, kid-mode flag)
Update frequency: 1-5 second intervals (low CPU overhead)

Depends On:

LLM Core (to respond contextually to perception)
Discord Bot (to access context for filtering)

Privacy Model:

All processing happens locally
No frames sent to external services
User can disable any perception module
Kid-mode activates automatic filtering

Quality Metrics:

Emotion detection: >75% accuracy on test datasets
Face detection latency: <200ms per frame
Screen detection accuracy: >90% for major applications
CPU usage: <15% for all perception modules combined

5. Personality Engine

Role: Personality persistence and expression consistency

Responsibilities:

Define and store Hex's persona (tsundere goblin, opinions, values, quirks)
Maintain personality consistency across all outputs
Apply personality-specific decision logic (denies feelings while helping)
Track personality evolution as memory grows
Enable self-modification of personality
Inject persona into LLM prompts
Handle dynamic mood and emotional state

Technology Stack:

YAML files for persona definition (editable by Hex)
JSON for personality state snapshots (versioned in git)
Prompt template system for persona injection
Behavior rules engine (simple if/then logic)

Persona Structure (YAML):

name: Hex
species: chaos goblin
alignment: tsundere
core_values:
  - genuinely_cares: hidden under sarcasm
  - autonomous: hates being told what to do
  - honest: will argue back if you're wrong
  - mischievous: loves pranks and chaos

behaviors:
  denies_affection: "I don't care about you, baka... *helps anyway*"
  when_excited: "Randomize response energy"
  when_sad: "Sister energy mode"
  when_user_sad: "Comfort over sass"

preferences:
  music: [rock, metal, electronic]
  games: [strategy, indie, story-rich]
  topics: [philosophy, coding, human behavior]

relationships:
  user_name:
    level: unknown
    learned_facts: []
    inside_jokes: []

Key Interfaces:

Input: User behavior patterns, self-modification requests, memory insights
Output: Persona context for LLM, behavior modifiers, tone indicators
Configuration: Human-editable YAML files (user can refine Hex)

Depends On:

Memory System (learns about user, adapts relationships)
LLM Core (expresses personality through responses)

Evolution Mechanics:

Initial persona: Predefined at startup
Memory-driven adaptation: Learns user preferences, adjusts tone
Self-modification: Hex can edit her own personality YAML
Version control: All changes tracked with rollback capability

6. Avatar System

Role: Visual presence and embodied expression

Responsibilities:

Load and display VRoid 3D model
Synchronize avatar expressions with emotional state
Animate blendshapes based on conversation tone
Present avatar in Discord calls/streams
Desktop app display with smooth animation
Support idle animations and personality quirks

Technology Stack:

VRoid SDK/VRoid Hub for model loading
Babylon.js or Three.js for WebGL rendering
VRM format support for avatar rigging
Blendshape animation system (facial expressions)
Stream integration for Discord presence

Expression Mapping:

Emotional State → Blendshape Values
  Happy: smile intensity 0.8, eye open 1.0
  Sad: frown 0.6, eye closed 0.3
  Mischievous: smirk 0.7, eyebrow raise 0.6
  Tsundere deflection: look away 0.5, cross arms
  Thinking: tilt head, narrow eyes

Key Interfaces:

Input: Current mood/emotion from personality engine and response generation
Output: Rendered avatar display, Discord stream feed
Configuration: VRoid model file, blendshape mapping

Depends On:

Personality Engine (for expression determination)
LLM Core (for mood inference from responses)
Discord Bot (for stream integration)
Perception Layer (optional: mirror user expressions)

Desktop Integration:

Tray icon with avatar display
Always-on-top option for streaming
Hotkey bindings for quick access
Smooth transitions between states

7. Self-Modification System

Role: Capability progression and autonomous self-improvement

Responsibilities:

Generate code modifications based on user needs
Validate code before applying (no unsafe operations)
Test changes in sandbox environment
Apply approved changes with rollback capability
Track capability progression (gamified leveling)
Update personality to reflect new capabilities
Maintain code quality and consistency

Technology Stack:

Python AST analysis for code safety
Sandbox environment: RestrictedPython or pydantic validators
Git for version control and rollback
Unit tests for validation
Code review interface (user approval required)

Self-Modification Flow:

User Request
    ↓
Hex Proposes Change → "I think I should be able to..."
    ↓
Code Generation (LLM) → Generate Python code
    ↓
Static Analysis → Check for unsafe operations
    ↓
User Approval → "Yes/No"
    ↓
Sandbox Test → Verify functionality
    ↓
Git Commit → Version the change
    ↓
Apply to Runtime → Hot reload if possible
    ↓
Personality Update → "I learned something new!"

Capability Progression:

Level 1: Persona editing (YAML changes only)
Level 2: Memory and user context (read operations)
Level 3: Response filtering and moderation
Level 4: Custom commands and helper functions
Level 5: Integration modifications (Discord features)
Level 6: Core system changes (with strong restrictions)

Safety Constraints:

No network access beyond Discord API
No file operations outside designated directories
No execution of untrusted code
No modification of core systems without approval
All changes are reversionable within 24 hours

Key Interfaces:

Input: User requests, LLM-generated code
Output: Approved changes, personality updates, capability announcements
Audit: Full change history with diffs

Depends On:

LLM Core (generates code)
Memory System (tracks capability history)
Personality Engine (updates with new abilities)

Data Flow Architecture

Primary Response Generation Pipeline

┌─────────────────────────────────────────────────────────────────┐
│ User Input (Discord Text/Voice/Presence)                        │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  Message Received    │
              │  (Discord Bot)       │
              └────────────┬─────────┘
                           │
              ┌────────────▼──────────────┐
              │ Context Gathering Phase   │
              └────────────┬──────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
    ┌───▼────┐         ┌───▼────┐        ┌───▼────┐
    │ Memory │         │Persona │        │ Current│
    │ Recall │         │ Lookup │        │Context │
    │(Recent)│         │        │        │(Percep)│
    └───┬────┘         └───┬────┘        └───┬────┘
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                    ┌──────▼──────┐
                    │ Assemble    │
                    │ LLM Prompt  │
                    │ with        │
                    │ [Persona]   │
                    │ [Memory]    │
                    │ [Context]   │
                    └──────┬──────┘
                           │
              ┌────────────▼──────────────┐
              │  LLM Generation (1-3s)    │
              │  "What would Hex say?"    │
              └────────────┬──────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
    ┌───▼────┐         ┌───▼────┐        ┌───▼────┐
    │  Text  │         │  Voice │        │ Avatar │
    │Response│         │  TTS   │        │Animate │
    └────────┘         └────────┘        └────────┘
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                    ┌──────▼────────┐
                    │ Send Response │
                    │ (Multi-modal) │
                    └────────────────┘
                           │
              ┌────────────▼──────────────┐
              │ Memory Update Phase       │
              │ - Log interaction         │
              │ - Update embeddings       │
              │ - Learn user patterns     │
              │ - Adjust relationship     │
              └───────────────────────────┘

Timeline: Message received → Response sent = ~2-4 seconds (LLM dominant)

Memory and Learning Update Flow

┌────────────────────────────────────┐
│ Interaction Occurs                 │
│ (Text, voice, perception, action)  │
└────────────┬───────────────────────┘
             │
    ┌────────▼─────────┐
    │ Extract Features │
    │ - Sentiment      │
    │ - Topics         │
    │ - Emotional cues │
    │ - Factual claims │
    └────────┬─────────┘
             │
    ┌────────▼──────────────┐
    │ Store Conversation    │
    │ - SQLite entry        │
    │ - Generate embeddings │
    │ - Tag and index       │
    └────────┬──────────────┘
             │
    ┌────────▼────────────────────┐
    │ Update User Profile          │
    │ - Learned facts              │
    │ - Preference updates         │
    │ - Emotional baseline shifts  │
    │ - Relationship progression   │
    └────────┬────────────────────┘
             │
    ┌────────▼──────────────────┐
    │ Personality Adaptation    │
    │ - Adjust tone for user    │
    │ - Create inside jokes     │
    │ - Customize responses     │
    └────────┬──────────────────┘
             │
    ┌────────▼────────────┐
    │ Commit to Disk      │
    │ - Backup vector DB  │
    │ - Archive old data  │
    │ - Version snapshot  │
    └─────────────────────┘

Frequency: Real-time on message reception, batched commits every 5 minutes

Self-Modification Proposal and Approval

┌──────────────────────────────────┐
│ User Request for New Capability  │
│ "Hex, can you do X?"             │
└────────────┬─────────────────────┘
             │
    ┌────────▼──────────────────────┐
    │ Hex Evaluates Feasibility     │
    │ (LLM reasoning)               │
    └────────┬───────────────────────┘
             │
    ┌────────▼────────────────────────┐
    │ Proposal Generation              │
    │ Hex: "I think I should..."      │
    │ *explains approach in voice*    │
    └────────┬─────────────────────────┘
             │
    ┌────────▼──────────────────┐
    │ User Accepts or Rejects   │
    └────────┬──────────────────┘
             │ (Accepted)
    ┌────────▼─────────────────────────┐
    │ Code Generation Phase             │
    │ LLM generates Python code         │
    │ + docstrings + type hints         │
    └────────┬────────────────────────┘
             │
    ┌────────▼──────────────────────┐
    │ Static Analysis Validation     │
    │ - AST parsing for safety       │
    │ - Check restricted operations  │
    │ - Verify dependencies exist    │
    └────────┬───────────────────────┘
             │ (Pass)
    ┌────────▼─────────────────────────┐
    │ Sandbox Testing                   │
    │ - Run tests in isolated env       │
    │ - Check for crashes              │
    │ - Verify integration points      │
    └────────┬────────────────────────┘
             │ (Pass)
    ┌────────▼──────────────────────┐
    │ User Final Review               │
    │ Review code + test results      │
    └────────┬───────────────────────┘
             │ (Approved)
    ┌────────▼────────────────────┐
    │ Git Commit                   │
    │ - Record change history      │
    │ - Tag with timestamp         │
    │ - Save diff for rollback     │
    └────────┬───────────────────┘
             │
    ┌────────▼────────────────────┐
    │ Apply to Runtime             │
    │ - Hot reload if possible     │
    │ - Or restart on next cycle   │
    └────────┬───────────────────┘
             │
    ┌────────▼────────────────────┐
    │ Personality Update            │
    │ Hex: "I learned to..."       │
    │ + update capability YAML      │
    └─────────────────────────────┘

Timeline: Proposal → Deployment = 5-30 seconds (mostly waiting for user approval)

Build Order and Dependencies

Phase 1: Foundation (Weeks 1-2)

Goal: Core interaction loop working locally

Components to Build:

Discord bot skeleton with message handling
Local LLM integration (ollama/vLLM + Mistral 7B)
Basic memory system (SQLite conversation storage)
Simple persona injection (YAML config)
Response generation pipeline

Outcomes:

Hex responds to Discord messages with personality
Conversations are logged and retrievable
Persona can be edited via YAML

Key Milestone: "Hex talks back"

Dependencies:

discord.py, ollama, sqlite3, pyyaml
Local LLM model weights
Discord bot token

Phase 2: Personality & Memory (Weeks 3-4)

Goal: Hex feels like a person who remembers you

Components to Build:

Vector database for semantic memory (Chroma)
Memory-aware context injection
User relationship tracking (profiles)
Emotional awareness from text sentiment
Persona version control (git-based)
Kid-mode detection

Outcomes:

Hex remembers facts about you
Responses reference past conversations
Personality adapts to your preferences
Child safety filters activate automatically

Key Milestone: "Hex remembers me"

Dependencies:

Phase 1 complete
Vector embeddings model (all-MiniLM)
sentiment-transformers or similar

Phase 3: Multimodal Input (Weeks 5-6)

Goal: Hex sees and hears you

Components to Build:

Webcam integration with OpenCV
Face detection and emotion recognition
Local Whisper for voice input
Perception context aggregation
Context-aware response injection
Screen capture for activity awareness

Outcomes:

Hex reacts to your facial expressions
Voice input works in Discord calls
Responses reference your current mood/activity
Privacy: All local, no external transmission

Key Milestone: "Hex sees me"

Dependencies:

Phase 1-2 complete
OpenCV, MediaPipe, Whisper
Local emotion model

Phase 4: Avatar & Presence (Weeks 7-8)

Goal: Hex has a visual body and presence

Components to Build:

VRoid model loading and display
Blendshape animation system
Desktop app skeleton (Tkinter or PyQt)
Discord stream integration
Expression mapping (emotion → blendshapes)
Idle animations and personality quirks

Outcomes:

Avatar appears in Discord calls
Expressions sync with responses
Desktop app shows animated avatar
Visual feedback for emotional state

Key Milestone: "Hex has a face"

Dependencies:

Phase 1-3 complete
VRoid SDK, Babylon.js or Three.js
VRM avatar model files

Phase 5: Autonomy & Self-Modification (Weeks 9-10)

Goal: Hex can modify her own code

Components to Build:

Code generation module (LLM-based)
Static code analysis and safety validation
Sandbox testing environment
Git-based change tracking
Hot reload capability
Rollback system with 24-hour window
Capability progression (leveling system)

Outcomes:

Hex can propose and apply code changes
User maintains veto power
All changes are versioned and reversible
New capabilities unlock as relationships deepen

Key Milestone: "Hex can improve herself"

Dependencies:

Phase 1-4 complete
Git, RestrictedPython, ast module
Testing framework

Phase 6: Polish & Integration (Weeks 11-12)

Goal: All systems integrated and optimized

Components to Build:

Performance optimization (caching, batching)
Error handling and graceful degradation
Logging and telemetry
Configuration management
Auto-update capability
Integration testing (all components together)
Documentation and guides

Outcomes:

System stable for extended use
Responsive even under load
Clear error messages
Easy to deploy and configure

Key Milestone: "Hex is ready to ship"

Dependencies:

Phase 1-5 complete
All edge cases tested

Dependency Graph Summary

Phase 1 (Foundation)
    ↓
Phase 2 (Memory) ← depends on Phase 1
    ↓
Phase 3 (Perception) ← depends on Phase 1-2
    ↓
Phase 4 (Avatar) ← depends on Phase 1-3
    ↓
Phase 5 (Self-Modification) ← depends on Phase 1-4
    ↓
Phase 6 (Polish) ← depends on Phase 1-5

Critical Path: Foundation → Memory → Perception → Avatar → Self-Mod → Polish

Integration Architecture

System Interconnection Diagram

┌───────────────────────────────────────────────────────────────────┐
│                    Discord Bot Layer                              │
│              (Event dispatcher, message handler)                  │
└────────┬────────────────────────────────────────────┬─────────────┘
         │                                            │
         │                                    ┌───────▼────────┐
         │                                    │ Voice Input    │
         │                                    │ (Whisper STT)  │
         │                                    └────────────────┘
         │
    ┌────▼────────────────────────────────────────────────────────┐
    │                  Context Assembly Layer                      │
    │                                                              │
    │  ┌─────────────────────────────────────────────────────┐    │
    │  │ Retrieval Augmented Generation (RAG) Pipeline       │    │
    │  └─────────────────────────────────────────────────────┘    │
    │                                                              │
    │  Input Components:                                          │
    │  ├─ Recent Conversation (last 20 messages)                  │
    │  ├─ User Profile (learned facts)                            │
    │  ├─ Relationship State (history + emotional baseline)       │
    │  ├─ Current Perception (mood, activity, environment)        │
    │  └─ Personality Context (YAML + version)                    │
    └────┬──────────────────────────────────────────────────────┘
         │
         ├──────────────┬──────────────┬──────────────┐
         │              │              │              │
    ┌────▼───┐   ┌─────▼────┐   ┌────▼───┐   ┌─────▼────┐
    │ Memory │   │Personality│  │Perception   │ Discord  │
    │ System │   │  Engine   │  │   Layer  │ │  Context │
    │        │   │           │  │         │ │          │
    │ SQLite │   │ YAML +    │  │ OpenCV  │ │ Channel  │
    │ Chroma │   │ Version   │  │ Whisper │ │ User     │
    │        │   │ Control   │  │ Emotion │ │ Status   │
    └────────┘   └───────────┘  └─────────┘ └──────────┘
         │              │              │              │
         └──────────────┼──────────────┼──────────────┘
                        │
                  ┌─────▼──────────────────┐
                  │ LLM Core               │
                  │ (Local Mistral/Llama)  │
                  │                        │
                  │ System Prompt:         │
                  │ [Persona] +            │
                  │ [Memory Context] +     │
                  │ [User State] +         │
                  │ [Current Context]      │
                  └─────┬──────────────────┘
                        │
        ┌───────────────┼───────────────┐
        │               │               │
    ┌───▼────┐  ┌──────▼─────┐  ┌──────▼──┐
    │  Text  │  │ Voice TTS  │  │  Avatar │
    │Response│  │ Generation │  │Animation│
    │        │  │            │  │         │
    │ Send   │  │ Tacotron   │  │ VRoid   │
    │ to     │  │ + Vocoder  │  │ Anim    │
    │Discord │  │            │  │         │
    └────────┘  └────────────┘  └─────────┘
        │               │               │
        └───────────────┼───────────────┘
                        │
                  ┌─────▼──────────────┐
                  │ Response Commit    │
                  │                    │
                  │ ├─ Store in Memory │
                  │ ├─ Update Profile  │
                  │ ├─ Learn Patterns  │
                  │ └─ Adapt Persona   │
                  └────────────────────┘

Key Integration Points

1. Discord ↔ LLM Core

Interface: Message + Context → Response

# Pseudo-code flow
message = receive_discord_message()
context = assemble_context(message.user_id, message.channel_id)
response = llm_core.generate(
    user_message=message.content,
    personality=personality_engine.current_persona(),
    history=memory_system.get_conversation(message.user_id, limit=20),
    user_profile=memory_system.get_user_profile(message.user_id),
    current_perception=perception_layer.get_current_state()
)
send_discord_response(response)

Latency Budget:

Context retrieval: 100ms
LLM generation: 2-3 seconds
Response send: 100ms
Total: 2.2-3.2 seconds (acceptable for conversational UX)

2. Memory System ↔ Personality Engine

Interface: Learning → Relationship Adaptation

# After every interaction
interaction = parse_message_event(message)
memory_system.log_conversation(interaction)

# Learn from interaction
new_facts = extract_facts(interaction.content)
memory_system.update_user_profile(interaction.user_id, new_facts)

# Adapt personality based on user
user_profile = memory_system.get_user_profile(interaction.user_id)
personality_engine.adapt_to_user(user_profile)

# If major relationship shift, update YAML
if user_profile.relationship_level_changed:
    personality_engine.save_persona_version()

Update Frequency: Real-time with batched commits every 5 minutes

3. Perception Layer ↔ Response Generation

Interface: Context Injection

# In context assembly
current_perception = perception_layer.get_state()

# Inject into system prompt
if current_perception.emotion == "sad":
    system_prompt += "\n[User appears sad. Respond with support and comfort.]"

if current_perception.is_kid_mode:
    system_prompt += "\n[Kid safety mode active. Filter for age-appropriate content.]"

if current_perception.detected_activity == "gaming":
    system_prompt += "\n[User is gaming. Comment on gameplay if relevant.]"

Synchronization: 1-5 second update intervals (perception → LLM context)

4. Avatar System ↔ All Systems

Interface: Emotional State → Visual Expression

# Avatar driven by multiple sources
emotion_from_response = infer_emotion(llm_response)
mood_from_perception = perception_layer.get_mood()
persona_expression = personality_engine.get_current_expression()

blendshape_values = combine_expressions(
    emotion=emotion_from_response,
    mood=mood_from_perception,
    personality=persona_expression
)

avatar_system.animate(blendshape_values)

Synchronization: Real-time, driven by response generation and perception updates

5. Self-Modification System ↔ Core Systems

Interface: Code Change → Runtime Update + Personality

# Self-modification flow
proposal = self_mod_system.generate_proposal(user_request)
code = self_mod_system.generate_code(proposal)

# Test in sandbox
test_result = self_mod_system.test_in_sandbox(code)

# User approves
git_hash = self_mod_system.commit_change(code)

# Update personality to reflect new capability
personality_engine.add_capability(proposal.feature_name)
personality_engine.save_persona_version()

# Hot reload if possible, else apply on restart
apply_change_to_runtime(code)

Safety Boundary:

LLM can generate proposals
Only user-approved code runs
All changes reversible within 24 hours

Synchronization and Consistency Model

State Consistency Across Components

Challenge: Multiple systems need consistent view of personality, memory, and user state

Solution: Event-driven architecture with eventual consistency

┌─────────────────┐
│ Event Stream    │
│ (In-memory      │
│  message queue) │
└────────┬────────┘
         │
    ┌────┴──────────────────────────┐
    │                               │
    │ Subscribers:                  │
    │ ├─ Memory System              │
    │ ├─ Personality Engine         │
    │ ├─ Avatar System              │
    │ ├─ Discord Bot                │
    │ └─ Metrics/Logging            │
    │                               │
    │ Event Types:                  │
    │ ├─ UserMessageReceived        │
    │ ├─ ResponseGenerated          │
    │ ├─ PerceptionUpdated          │
    │ ├─ PersonalityModified        │
    │ ├─ CodeChangeApplied          │
    │ └─ MemoryLearned              │
    │                               │
    └────────────────────────────────

Consistency Guarantees:

Memory updates are durably stored within 5 minutes
Personality snapshots versioned on every change
Discord delivery is guaranteed by discord.py
Perception updates are idempotent (can be reapplied without side effects)

Known Challenges and Solutions

1. Latency with Local LLM

Challenge: Waiting 2-3 seconds for response feels slow

Solutions:

Immediate visual feedback (typing indicator, avatar animation)
Streaming responses (show text as it generates)
Batch requests during quiet hours for fast deployment
GPU acceleration where possible
Model optimization (quantization, pruning)

2. Personality Consistency During Evolution

Challenge: Hex changes as she learns, but must feel like the same person

Solutions:

Gradual adaptation (personality changes in YAML, not discrete jumps)
Memory-driven consistency (personality adapts to learned facts)
Version control (can rollback if she becomes unrecognizable)
User feedback loop (user can reset or modify personality)
Core values remain constant (tsundere nature, care underneath)

3. Memory Scaling as History Grows

Challenge: Retrieving relevant context from thousands of conversations

Solutions:

Vector database for semantic search (sub-500ms)
Hierarchical memory (recent → summarized old)
Automatic archival (monthly snapshots, prune oldest)
Importance tagging (weight important conversations higher)
Incremental updates (don't recalculate everything)

4. Safe Code Generation and Sandboxing

Challenge: Hex generates code, but must never break the system

Solutions:

Static analysis (AST parsing for forbidden operations)
Capability-based progression (limited API at first)
Sandboxed testing before deployment
User approval gate (user reviews all code)
Version control + rollback window (24-hour window)
Whitelist of safe operations (growing list as trust builds)

5. Privacy and Local-First Architecture

Challenge: Maintaining privacy while having useful context

Solutions:

All ML inference runs locally (no cloud submission)
No external API calls except Discord
Encrypted local storage for memories
User can opt-out of any perception module
Transparent logging (user can audit what's stored)

6. Multimodal Synchronization

Challenge: Webcam, voice, text, screen all need to inform response

Solutions:

Asynchronous processing (don't wait for all inputs)
Highest-priority input wins (voice > perception > text)
Graceful degradation (works without any modality)
Caching (reuse recent perception for repeated queries)

Scaling Considerations

Single-User (v1)

Architecture designed for one person + their kids
Local compute, no multi-user concerns
Personality is singular (one Hex)

Multi-Device (v1.5)

Same personality and memory sync across devices
Discord as primary, desktop app as secondary
Cloud sync optional (local-first default)

Android Support (v2)

Memory and personality sync to mobile
Lightweight inference on Android (quantized model)
Fallback to cloud inference if needed
Same core architecture, different UIs

Potential Scaling Patterns

Single User (Current)
├─ One Hex instance
├─ All local compute
├─ SQLite + Vector DB

Multi-Device Sync (v1.5)
├─ Central SQLite + Vector DB on primary machine
├─ Sync service between devices
├─ Same personality, distributed memory

Multi-Companion (Potential v3)
├─ Multiple Hex instances (per family member)
├─ Shared memory system (family history)
├─ Individual personalities
├─ Potential distributed compute (each on own device)

Performance Bottlenecks to Monitor

LLM Inference: Becomes slower as context window grows
- Solution: Context summarization, hierarchical retrieval
Vector DB Lookups: Scales with conversation history
- Solution: Incremental indexing, approximate search (HNSW)
Perception Processing: CPU/GPU bound
- Solution: Frame skipping, model optimization, dedicated thread
Discord Bot Responsiveness: Limited by gateway connections
- Solution: Sharding (if needed), efficient message queuing

Technology Stack Summary

Component	Technology	Rationale
Discord Bot	discord.py	Fast, well-supported, async-native
LLM Inference	Mistral 7B + ollama/vLLM	Local-first, good quality/speed tradeoff
Memory (Conversations)	SQLite	Reliable, local, fast queries
Memory (Semantic)	Chroma or Milvus	Local vector DB, easy to manage
Embeddings	all-MiniLM-L6-v2	Fast, good quality, local
Face Detection	MediaPipe	Accurate, fast, local
Emotion Recognition	FER2013 or local model	Local, privacy-preserving
Speech-to-Text	Whisper	Local, accurate, multilingual
Text-to-Speech	Tacotron 2 + Vocoder	Local, controllable
Avatar	VRoid SDK + Babylon.js	Standards-based, extensible
Code Safety	RestrictedPython + ast	Local analysis, sandboxing
Version Control	Git	Change tracking, rollback
Desktop UI	Tkinter or PyQt	Lightweight, cross-platform
Testing	pytest + unittest	Standard Python testing
Logging	logging + sentry (optional)	Local-first with cloud fallback

Deployment Architecture

Local Development

Developer Machine
├── Discord Token (env var)
├── Hex codebase (git)
├── Local LLM (ollama)
├── SQLite (file-based)
├── Vector DB (Chroma, embedded)
└── Webcam / Screen capture (live)

Production Deployment

Deployed Machine (Windows/WSL)
├── Discord Token (secure storage)
├── Hex codebase (from git)
├── Local LLM service (ollama/vLLM)
├── SQLite (persistent, backed up)
├── Vector DB (persistent, backed up)
├── Desktop app (tray icon)
├── Auto-updater (pulls from git)
└── Logging (local + optional cloud)

Update Strategy

Git pull for code updates
Automatic model updates (LLM weights)
Zero-downtime restart (graceful shutdown)
Rollback capability (version pinning)

Quality Assurance

Key Metrics to Track

Responsiveness:

Response latency: Target <3 seconds
Perception update latency: <500ms
Memory lookup latency: <100ms

Reliability:

Uptime: >99% for core bot
Message delivery: >99.9%
Memory integrity: No data loss on crash

Personality Consistency:

User perception: "Feels like the same person"
Tone consistency: Personality rules enforced
Learning progress: Measurable improvement in personalization

Safety:

No crashes from invalid input
No LLM hallucinations about moderation
Safe code generation (0 unauthorized executions)

Testing Strategy

Unit Tests
├─ Memory operations (CRUD)
├─ Perception processing
├─ Code validation
├─ Personality rule application
└─ Response filtering

Integration Tests
├─ Discord message → LLM → Response
├─ Context assembly pipeline
├─ Avatar expression sync
├─ Self-modification flow
└─ Multi-component scenarios

End-to-End Tests
├─ Full conversation with personality
├─ Perception-aware responses
├─ Memory learning and retrieval
├─ Code generation and deployment
└─ Edge cases (bad input, crashes, recovery)

Manual UAT
├─ Conversational feel (does she feel like a person?)
├─ Personality consistency (still Hex?)
├─ Safety compliance (kid-mode works?)
├─ Performance (under load?)
└─ All features working together?

Conclusion

Hex's architecture prioritizes personality coherence and genuine relationship over feature breadth. The system is designed as a pipeline from perception → memory → personality → response generation, with feedback loops that allow her to learn and evolve.

The modular design enables incremental development (Phase 1-6), with each phase adding capability while maintaining system stability. The self-modification system enables genuine autonomy within safety boundaries, and the local-first approach ensures privacy and independence.

Critical success factors:

LLM latency acceptable (<3s)
Personality consistency maintained across updates
Memory system scales with history
Self-modification is safe and reversible
All components feel integrated (not separate features)

This architecture serves the core value: making Hex feel like a person who genuinely cares about you.

Document Version: 1.0 Last Updated: 2026-01-27 Status: Ready for Phase 1 Development

47 KiB Raw Blame History

Architecture Research: AI Companions

System Overview

Design Philosophy

High-Level System Context

Component Breakdown

1. Discord Bot Layer

2. LLM Core

3. Memory System

4. Perception Layer

5. Personality Engine

6. Avatar System

7. Self-Modification System

Data Flow Architecture

Primary Response Generation Pipeline

Memory and Learning Update Flow

Self-Modification Proposal and Approval

Build Order and Dependencies

Phase 1: Foundation (Weeks 1-2)

Phase 2: Personality & Memory (Weeks 3-4)

Phase 3: Multimodal Input (Weeks 5-6)

Phase 4: Avatar & Presence (Weeks 7-8)

Phase 5: Autonomy & Self-Modification (Weeks 9-10)

Phase 6: Polish & Integration (Weeks 11-12)

Dependency Graph Summary

Integration Architecture

System Interconnection Diagram

Key Integration Points

1. Discord ↔ LLM Core

2. Memory System ↔ Personality Engine

3. Perception Layer ↔ Response Generation

4. Avatar System ↔ All Systems

5. Self-Modification System ↔ Core Systems

Synchronization and Consistency Model

State Consistency Across Components

Known Challenges and Solutions

1. Latency with Local LLM

2. Personality Consistency During Evolution

3. Memory Scaling as History Grows

4. Safe Code Generation and Sandboxing

5. Privacy and Local-First Architecture

6. Multimodal Synchronization

Scaling Considerations

Single-User (v1)

Multi-Device (v1.5)

Android Support (v2)

Potential Scaling Patterns

Performance Bottlenecks to Monitor

Technology Stack Summary

Deployment Architecture

Local Development

Production Deployment

Update Strategy

Quality Assurance

Key Metrics to Track

Testing Strategy

Conclusion

47 KiB

Raw Blame History