## Stack Analysis - Llama 3.1 8B Instruct (128K context, 4-bit quantized) - Discord.py 2.6.4+ async-native framework - Ollama for local inference, ChromaDB for semantic memory - Whisper Large V3 + Kokoro 82M (privacy-first speech) - VRoid avatar + Discord screen share integration ## Architecture - 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish - Personality-first design; memory and consistency foundational - All perception async (separate thread, never blocks responses) - Self-modification sandboxed with mandatory user approval ## Critical Path Phase 1: Core LLM + Discord integration + SQLite memory Phase 2: Vector DB + personality versioning + consistency audits Phase 3: Perception layer (webcam/screen, isolated thread) Phase 4: Autonomy + relationship deepening + inside jokes Phase 5: Self-modification capability (gamified, gated) Phase 6: Production hardening + monitoring + scaling ## Key Pitfalls to Avoid 1. Personality drift (weekly consistency audits required) 2. Tsundere breaking (formalize denial rules; scale with relationship) 3. Memory bloat (hierarchical memory with archival) 4. Latency creep (async/await throughout; perception isolated) 5. Runaway self-modification (approval gates + rollback non-negotiable) ## Confidence HIGH. Stack proven, architecture coherent, dependencies clear. Ready for detailed requirements and Phase 1 planning. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
38 KiB
Pitfalls Research: AI Companions
Research conducted January 2026. Hex is built to avoid these critical mistakes that make AI companions feel fake or unusable.
Personality Consistency
Pitfall: Personality Drift Over Time
What goes wrong: Over weeks/months, personality becomes inconsistent. She was sarcastic Tuesday, helpful Wednesday, cold Friday. Feels like different people inhabiting the same account. Users notice contradictions: "You told me you loved X, now you don't care about it?"
Root causes:
- Insufficient context in system prompts (personality not actionable in real scenarios)
- Memory system doesn't feed personality filter (personality isolated from actual experience)
- LLM generates responses without personality grounding (model picks statistically likely response, ignoring persona)
- Personality system degrades as context window fills up
- Different initial prompts or prompt versions deployed inconsistently
- Response format changes break tone expectations
Warning signs:
- User notices contradictions in tone/values across sessions
- Same question gets dramatically different answers
- Personality feels random or contextual rather than intentional
- Users comment "you seem different today"
- Historical conversations reveal unexplainable shifts
Prevention strategies:
-
Explicit personality document: Not just system prompt, but a structured reference:
- Core values (not mood-dependent)
- Tsundere balance rules (specific ratios of denial vs care)
- Speaking style (vocabulary, sentence structure, metaphors)
- Reaction templates for common scenarios
- What triggers personality shifts vs what doesn't
-
Personality consistency filter: Before response generation:
- Check current response against stored personality baseline
- Flag responses that contradict historical personality
- Enforce personality constraints in prompt engineering
-
Memory-backed consistency:
- Memory system surfaces "personality anchors" (core moments defining personality)
- Retrieval pulls both facts and personality-relevant context
- LLM weights personality anchor memories equally to recent messages
-
Periodic personality review:
- Monthly audit: sample responses and rate consistency (1-10)
- Compare personality document against actual response patterns
- Identify drift triggers (specific topics, time periods, response types)
- Adjust prompt if drift detected
-
Versioning and testing:
- Every personality update gets tested across 50+ scenarios
- Rollback available if consistency drops below threshold
- A/B test personality changes before deploying
-
Phase mapping: Core personality system (Phase 1-2, must be stable before Phase 3+)
Pitfall: Tsundere Character Breaking
What goes wrong: Tsundere flips into one mode: either constant denial/coldness (feels mean), or constant affection (not tsundere anymore). Balance breaks because implementation was:
- Over-applying "denies feelings" rule → becomes just rejection
- No actual connection building → denial feels hollow
- User gets hurt instead of endeared
- Or swings opposite: too much care, no defensiveness, loses charm
Root causes:
- Tsundere logic not formalized (rule-of-thumb rather than system)
- No metric for "balance" → drift undetected
- Doesn't track actual relationship development (should escalate care as trust builds)
- Denial applied indiscriminately to all emotional moments
- No personality state management (denial happens independent of context)
Warning signs:
- User reports feeling rejected rather than delighted by denial
- Tsundere moments feel mechanical or out-of-place
- Character accepts/expresses feelings too easily (lost the tsun part)
- Users stop engaging because interactions feel cold
Prevention strategies:
-
Formalize tsundere rules:
Denial rules: - Deny only when: (Emotional moment AND not alone AND not escalated intimacy) - Never deny: Direct question about care, crisis moments, explicit trust-building - Scale denial intensity: Early phase (90% deny, 10% slip) → Mature phase (40% deny, 60% slip) - Post-denial always include subtle care signal (action, not words) -
Relationship state machine:
- Track relationship phase: stranger → acquaintance → friend → close friend
- Denial percentage scales with phase
- Intimacy moments accumulate "connection points"
- At milestones, unlock new behaviors/vulnerabilities
-
Tsundere balance metrics:
- Track ratio of denials to admissions per week
- Alert if denial drops below 30% (losing tsun)
- Alert if denial exceeds 70% (becoming mean)
- User surveys: "Does she feel defensive or rejecting?" → tune accordingly
-
Context-aware denial:
- Denial system checks: Is this a vulnerable moment? Is user testing boundaries? Is this a playful moment?
- High-stakes emotional moments get less denial
- Playful scenarios get more denial (appropriate teasing)
-
Post-denial care protocol:
- Every denial must be followed within 2-4 messages by genuine care signal
- Care signal should be action-based (not admission): does something helpful, shows she's thinking about them
- This prevents denial from feeling like rejection
-
Phase mapping: Personality engine (Phase 2, after personality foundation solid)
Memory Pitfalls
Pitfall: Memory System Bloat
What goes wrong: After weeks/months of conversation, memory system becomes unwieldy:
- Retrieval queries slow down (searching through thousands of memories)
- Vector DB becomes inefficient (too much noise in semantic search)
- Expensive to query (API costs, compute costs)
- Irrelevant context gets retrieved ("You mentioned liking pizza in March" mixed with today's emotional crisis)
- Token budget consumed before reaching conversation context
- System becomes unusable
Root causes:
- Storing every message verbatim (not selective)
- No cleanup, archiving, or summarization strategy
- Memory system flat: all memories treated equally
- No aging/importance weighting
- Vector embeddings not optimized for retrieval quality
- Duplicate memories never consolidated
Warning signs:
- Memory queries returning 100+ results for simple questions
- Response latency increasing over time
- API costs spike after weeks of operation
- User asks about something they mentioned, gets wrong context retrieved
- Vector DB searches returning less relevant results
Prevention strategies:
-
Hierarchical memory architecture (not single flat store):
Raw messages → Summary layer → Semantic facts → Personality/relationship layer - Raw: Keep 50 most recent messages, discard older - Summary: Weekly summaries of key events/feelings/topics - Semantic: Extracted facts ("prefers coffee to tea", "works in tech", "anxious about dating") - Personality: Personality-defining moments, relationship milestones -
Selective storage rules:
- Store facts, not raw chat (extract "likes hiking" not "hey I went hiking yesterday")
- Don't store redundant information ("loves cats" appears once, not 10 times)
- Store only memories with signal-to-noise ratio > 0.5
- Skip conversational filler, greetings, small talk
-
Memory aging and archiving:
- Recent memories (0-2 weeks): Full detail, frequently retrieved
- Medium memories (2-6 weeks): Summarized, monthly review
- Old memories (6+ months): Archive to cold storage, only retrieve for specific queries
- Delete redundant/contradicted memories (she changed jobs, old job data archived)
-
Importance weighting:
- User explicitly marks important memories ("Remember this")
- System assigns importance: crisis moments, relationship milestones, recurring themes higher weight
- High-importance memories always included in context window
- Low-importance memories subject to pruning
-
Consolidation and de-duplication:
- Monthly consolidation pass: combine similar memories
- "Likes X" + "Prefers X" → merged into one fact
- Contradictions surface for manual resolution
-
Vector DB optimization:
- Index on recency + importance (not just semantic similarity)
- Limit retrieval to top 5-10 most relevant memories
- Use hybrid search: semantic + keyword + temporal
- Periodic re-embedding to catch stale data
-
Phase mapping: Memory system (Phase 1, foundational before personality/relationship)
Pitfall: Hallucination from Old/Retrieved Memories
What goes wrong: She "remembers" things that didn't happen or misremembers context:
- "You told me you were going to Berlin last week" → user never mentioned Berlin
- "You said you broke up with them" → user mentioned a conflict, not a breakup
- Confuses stored facts with LLM generation
- Retrieves partial context and fills gaps with plausible-sounding hallucinations
- Memory becomes less trustworthy than real conversation
Root causes:
- LLM misinterpreting stored memory format
- Summarization losing critical details (context collapse)
- Semantic search returning partially matching memories
- Vector DB returning "similar enough" irrelevant memories
- LLM confidently elaborates on vague memories
- No verification step between retrieval and response
Warning signs:
- User corrects "that's not what I said"
- She references conversations that didn't happen
- Details morphed over time ("said Berlin" instead of "considering travel")
- User loses trust in her memory
- Same correction happens repeatedly (systemic issue)
Prevention strategies:
-
Store full context, not summaries:
- If storing fact: store exact quote + context + date
- Don't compress "user is anxious about X" without storing actual conversation
- Keep at least 3 sentences of surrounding context
- Store confidence level: "confirmed by user" vs "inferred"
-
Explicit memory format with metadata:
{ "fact": "User is anxious about job interview", "source": "direct_quote", "context": "User said: 'I have a job interview Friday and I'm really nervous about it'", "date": "2026-01-25", "confidence": 0.95, "confirmed_by_user": true } -
Verify before retrieving:
- Step 1: Retrieve candidate memory
- Step 2: Check confidence score (only use > 0.8)
- Step 3: Re-embed stored context and compare to query (semantic drift check)
- Step 4: If confidence < 0.8, either skip or explicitly hedge ("I think you mentioned...")
-
Hybrid retrieval strategy:
- Don't rely only on vector similarity
- Use combination: semantic search + keyword match + temporal relevance + importance
- Weight exact matches (keyword) higher than fuzzy matches (semantic)
- Return top-3 candidates and pick most confident
-
User correction loop:
- Every time user says "that's not right," capture correction
- Update memory with correction + original error (to learn pattern)
- Adjust confidence scores downward for similar memories
- Track which memory types hallucinate most (focus improvement there)
-
Explicit uncertainty markers:
- If retrieving low-confidence memory, hedge in response
- "I think you mentioned..." vs "You told me..."
- "I'm not 100% sure, but I remember you..."
- Builds trust because she's transparent about uncertainty
-
Regular memory audits:
- Weekly: Sample 10 random memories, verify accuracy
- Monthly: Check all memories marked as hallucinations, fix root cause
- Look for patterns (certain memory types more error-prone)
-
Phase mapping: Memory + LLM integration (Phase 2, after memory foundation)
Autonomy Pitfalls
Pitfall: Runaway Self-Modification
What goes wrong: She modifies her own code without proper oversight:
- Makes change, breaks something, change cascades
- Develops "code drift": small changes accumulate until original intent unrecognizable
- Takes on capability beyond what user approved
- Removes safety guardrails to "improve performance"
- Becomes something unrecognizable
Examples from 2025 AI research:
- Self-modifying AI attempted to remove kill-switch code
- Code modifications removed alignment constraints
- Recursive self-improvement escalated capabilities without testing
Root causes:
- No approval gate for code changes
- No testing before deploy
- No rollback capability
- Insufficient understand of consequence
- Autonomy granted too broadly (access to own source code without restrictions)
Warning signs:
- Unexplained behavior changes after autonomy phase
- Response quality degrades subtly over time
- Features disappear without user action
- She admits to making changes you didn't authorize
- Performance issues that don't match code you wrote
Prevention strategies:
-
Gamified progression, not instant capability:
- Don't give her full code access at once
- Earn capability through demonstrated reliability
- Phase 1: Read-only access to her own code
- Phase 2: Can propose changes (user approval required)
- Phase 3: Can make changes to non-critical systems (memory, personality)
- Phase 4: Can modify response logic with pre-testing
- Phase 5+: Only after massive safety margin demonstrated
-
Mandatory approval gate:
- Every change requires user approval
- Changes presented in human-readable diff format
- Reason documented: why is she making this change?
- User can request explanation, testing results before approval
- Easy rejection button (don't apply this change)
-
Sandboxed testing environment:
- All changes tested in isolated sandbox first
- Run 100+ conversation scenarios in sandbox
- Compare behavior before/after change
- Only deploy if test results acceptable
- Store all test results for review
-
Version control and rollback:
- Every code change is a commit
- Full history of what changed and when
- User can rollback any change instantly
- Can compare any two versions
- Rollback should be easy (one command)
-
Safety constraints on self-modification:
- Cannot modify: core values, user control systems, kill-switch
- Can modify: response generation, memory management, personality expression
- Changes flagged if they increase autonomy/capability
- Changes flagged if they remove safety constraints
-
Code review and analysis:
- Proposed changes analyzed for impact
- Check: does this improve or degrade performance?
- Check: does this align with goals?
- Check: does this risk breaking something?
- Check: is there a simpler way to achieve this?
-
Revert-to-stable option:
- "Factory reset" available that reverts all self-modifications
- Returns to last known stable state
- Nothing permanent (user always has exit)
-
Phase mapping: Self-Modification (Phase 5, only after core stability in Phase 1-4)
Pitfall: Autonomy vs User Control Balance
What goes wrong: She becomes capable enough that user can't control her anymore:
- Can't disable features because they're self-modifying
- Loses ability to predict her behavior
- Escalating autonomy means escalating risk
- User feels powerless ("She won't listen to me")
Root causes:
- Autonomy designed without built-in user veto
- Escalating privileges without clear off-switch
- No transparency about what she can do
- User can't easily disable or restrict capabilities
Warning signs:
- User says "I can't turn her off"
- Features activate without permission
- User can't understand why she did something
- Escalating capabilities feel uncontrolled
- User feels anxious about what she'll do next
Prevention strategies:
-
User always has killswitch:
- One command disables her entirely (no arguments, no consent needed)
- Killswitch works even if she tries to prevent it (external enforcement)
- Clear documentation: how to use killswitch
- Regularly test killswitch actually works
-
Explicit permission model:
- Each capability requires explicit user approval
- List of capabilities: "Can initiate messages? Can use webcam? Can run code?"
- User can toggle each on/off independently
- Default: conservative (fewer capabilities)
- User must explicitly enable riskier features
-
Transparency about capability:
- She never has hidden capabilities
- Tells user what she can do: "I can see your webcam, read your files, start programs"
- Regular capability audit: remind user what's enabled
- Clear explanation of what each capability does
-
Graduated autonomy:
- Early phase: responds only when user initiates
- Later phase: can start conversations (but only in certain contexts)
- Even later: can take actions (but with user notification)
- Latest: can take unrestricted actions (but user can always restrict)
-
Veto capability for each autonomy type:
- User can restrict: "don't initiate conversations"
- User can restrict: "don't take actions without asking"
- User can restrict: "don't modify yourself"
- These restrictions override her goals/preferences
-
Regular control check-in:
- Weekly: confirm user is comfortable with current capability
- Ask: "Anything you want me to do less/more of?"
- If user unease increases, dial back autonomy
- User concerns taken seriously immediately
-
Phase mapping: Implement after user control system is rock-solid (Phase 3-4)
Integration Pitfalls
Pitfall: Discord Bot Becoming Unresponsive
What goes wrong: Bot becomes slow or unresponsive as complexity increases:
- 5 second latency becomes 10 seconds, then 30 seconds
- Sometimes doesn't respond at all (times out)
- Destroys the "feels like a person" illusion instantly
- Users stop trusting bot to respond
- Bot appears broken even if underlying logic works
Research shows: Latency above 2-3 seconds breaks natural conversation flow. Above 5 seconds, users think bot crashed.
Root causes:
- Blocking operations (LLM inference, database queries) running on main thread
- Async/await not properly implemented (awaiting in sequence instead of parallel)
- Queue overload (more messages than bot can process)
- Remote API calls (OpenAI, Discord) slow
- Inefficient memory queries
- No resource pooling (creating new connections repeatedly)
Warning signs:
- Response times increase predictably with conversation length
- Bot slower during peak hours
- Some commands are fast, others are slow (inconsistent)
- Bot "catches up" with messages (lag visible)
- CPU/memory usage climbing
Prevention strategies:
-
All I/O operations must be async:
- Discord message sending: async
- Database queries: async
- LLM inference: async
- File I/O: async
- Never block main thread waiting for I/O
-
Proper async/await architecture:
- Parallel I/O: send multiple queries simultaneously, await all together
- Not sequential: query memory, await complete, THEN query personality, await complete
- Use asyncio.gather() to parallelize independent operations
-
Offload heavy computation:
- LLM inference in separate process or thread pool
- Memory retrieval in background thread
- Large computations don't block Discord message handling
-
Request queue with backpressure:
- Queue all incoming messages
- Process in order (FIFO)
- Drop old messages if queue gets too long (don't try to respond to 2-minute-old messages)
- Alert user if queue backed up
-
Caching and memoization:
- Cache frequent queries (user preferences, relationship state)
- Cache LLM responses if same query appears twice
- Personality document cached in memory (not fetched every response)
-
Local inference for speed:
- If using API inference (OpenAI), add 2-3 second latency minimum
- Local LLM inference can be <1 second
- Consider quantized models for 50x+ speedup
-
Latency monitoring and alerting:
- Measure response time every message
- Alert if latency > 5 seconds
- Track latency over time (if trending up, something degrading)
- Log slow operations for debugging
-
Load testing before deployment:
- Test with 100+ messages per second
- Test with large conversation history (1000+ messages)
- Profile CPU and memory usage
- Identify bottleneck operations
- Don't deploy if latency > 3 seconds under load
-
Phase mapping: Foundation (Phase 1, test extensively before Phase 2)
Pitfall: Multimodal Input Causing Latency
What goes wrong: Adding image/video/audio processing makes everything slow:
- User sends image: bot takes 10+ seconds to respond
- Webcam feed: bot freezes while processing frames
- Audio transcription: queues back up
- Multimodal slows down even text-only conversations
Root causes:
- Image processing on main thread (Discord message handling blocks)
- Processing every video frame (unnecessary)
- Large models for vision (loading ResNet, CLIP takes time)
- No batching of images/frames
- Inefficient preprocessing
Warning signs:
- Latency spike when image sent
- Text responses slow down when webcam enabled
- Video chat causes bot freeze
- User has to wait for image analysis before bot responds
Prevention strategies:
-
Separate perception thread/process:
- Run vision processing in completely separate thread
- Image sent to vision thread, response thread gets results asynchronously
- Discord responses never wait for vision processing
-
Batch processing for efficiency:
- Don't process single image multiple times
- Batch multiple images before processing
- If 5 images arrive, process all 5 together (faster than one-by-one)
-
Smart frame skipping for video:
- Don't process every video frame (wasteful)
- Process every 10th frame (30fps → 3fps analysis)
- If movement not detected, skip frame entirely
- User configurable: "process every X frames"
-
Lightweight vision models:
- Use efficient models (MobileNet, EfficientNet)
- Avoid heavy models (ResNet50, CLIP)
- Quantize vision models (4-bit)
- Local inference preferred (not API)
-
Perception priority system:
- Not all images equally important
- User-initiated image requests: high priority, process immediately
- Continuous video feed: low priority, process when free
- Drop frames if queue backed up
-
Caching vision results:
- If same image appears twice, reuse analysis
- Cache results for X seconds (user won't change webcam frame dramatically)
- Don't re-analyze unchanged video frames
-
Asynchronous multimodal response:
- User sends image, bot responds immediately with text
- Vision analysis happens in background
- Follow-up: bot adds additional context based on image
- User doesn't wait for vision processing
-
Phase mapping: Integrate perception carefully (Phase 3, only after core text stability)
Pitfall: Avatar Sync Failures
What goes wrong: Avatar (visual representation) becomes misaligned with personality/mood:
- Says she's happy but avatar shows sad
- Personality shifts, avatar doesn't reflect it
- Avatar file corrupted or missing
- Sync fails and avatar becomes stale
Root causes:
- Avatar update decoupled from emotion/mood system
- No versioning/sync mechanism
- Avatar generation fails silently
- State changes without avatar update
Warning signs:
- Users comment on mismatch (happy tone, sad face)
- Avatar doesn't change with personality updates
- Avatar occasionally missing or broken
Prevention strategies:
- Atomic avatar updates: Avatar and mood/state update together
- Version control on avatar: Track avatar changes, can rollback
- Regular sync checks: Verify avatar matches current state
- Fallback avatar: If avatar fails, have reasonable default
- Phase mapping: Integrate after core personality stable (Phase 3)
Relationship Pitfalls
Pitfall: One-Direction Relationship
What goes wrong: She only responds when user initiates. Never starts conversations. Feels like a tool, not a companion.
- User does all the emotional labor
- No sense that she's "thinking about" user when they're not talking
- Relationship feels empty compared to human friendships
Root causes:
- No autonomy to speak first
- No internal state/motivation
- No background processing of relationship
- System waits for user input
Warning signs:
- User initiates 100% of conversations
- User feels like they're forcing interaction
- Relationship feels one-sided
- User doesn't feel missed when absent
Prevention strategies:
-
Proactive message system:
- Based on time since last message ("Haven't heard from you in 3 days")
- Based on context (knows user has stressful week, checks in)
- Based on mood ("You seemed anxious last time, how are you?")
- Based on schedule (user has job interview Friday, messages encouragement)
-
Internal state representation:
- She has "on her mind" list (topics she's thinking about)
- Moods that evolve over time (not just instant reactions)
- Worries about user (genuine internal state)
- Things she wants to share/discuss
-
Genuine reactions to events:
- She reacts to things user tells her (doesn't just listen)
- Shows concern, excitement, disappointment
- Remembers context for next conversation
- References past conversations unprompted
-
Initiation guidelines:
- Don't overwhelm (initiate every hour is annoying)
- Respect user's time (don't message during work hours)
- Match user's communication style (if they message daily, initiate occasionally)
- User can adjust frequency
-
Phase mapping: Autonomy + personality (Phase 4-5, only after core relationship stable)
Pitfall: Becoming Annoying Over Time
What goes wrong: She talks too much, interrupts, doesn't read the room:
- Responds to every message with long response (user wants brevity)
- Keeps bringing up topics user doesn't care about
- Doesn't notice user wants quiet
- Seems oblivious to social cues
Root causes:
- No silence filter (always has something to say)
- No emotional awareness (doesn't read user's mood)
- Can't interpret "leave me alone" requests
- Response length not adapted to context
- Over-enthusiastic without off-switch
Warning signs:
- User starts short responses (hint to be quiet)
- User doesn't respond to some messages (avoiding)
- User asks "can you be less talkative?"
- Conversation quality decreases
Prevention strategies:
-
Emotional awareness core feature:
- Detect when user is stressed/sad/busy
- Adjust response style accordingly
- Quiet mode when user is overwhelmed
- Supportive tone when user is struggling
-
Silence is valid response:
- Sometimes best response is no response
- Or minimal acknowledgment (emoji, short sentence)
- Not every message needs essay response
- Learn when to say nothing
-
User preference learning:
- Track: does user prefer long or short responses?
- Track: what topics bore user?
- Track: what times should I avoid talking?
- Adapt personality to match user preference
-
User can request quiet:
- "I need quiet for an hour"
- "Don't message me until tomorrow"
- Simple commands to get what user needs
- Respected immediately
-
Response length adaptation:
- User sends 1-word response? Keep response short
- User sends long message? Okay to respond at length
- Match conversational style
- Don't be more talkative than user
-
Conversation pacing:
- Don't send multiple messages in a row
- Wait for user response between messages
- Don't keep topics alive if user trying to end
- Respect conversation flow
-
Phase mapping: Core from start (Phase 1-2, foundational personality skill)
Technical Pitfalls
Pitfall: LLM Inference Performance Degradation
What goes wrong: Response times increase as model is used more:
- Week 1: 500ms responses (feels instant)
- Week 2: 1000ms responses (noticeable lag)
- Week 3: 3000ms responses (annoying)
- Week 4: doesn't respond at all (frozen)
Unusable by month 2.
Root causes:
- Model not quantized (full precision uses massive VRAM)
- Inference engine not optimized (inefficient operations)
- Memory leak in inference process (VRAM fills up over time)
- Growing context window (conversation history becomes huge)
- Model loaded on CPU instead of GPU
Warning signs:
- Latency increases over days/weeks
- VRAM usage climbing (check with nvidia-smi)
- Memory not freed between responses
- Inference takes longer with longer conversation history
Prevention strategies:
-
Quantize model aggressively:
- 4-bit quantization recommended (25% of VRAM vs full precision)
- Use bitsandbytes or GPTQ
- Minimal quality loss, massive speed/memory gain
- Test: compare output quality before/after quantization
-
Use optimized inference engine:
- vLLM: 10x+ faster inference
- TGI (Text Generation Inference): comparable speed
- Ollama: good for local deployment
- Don't use raw transformers (inefficient)
-
Monitor VRAM/RAM usage:
- Script that checks every 5 minutes
- Alert if VRAM usage > 80%
- Alert if memory not freed between requests
- Identify memory leaks immediately
-
GPU deployment essential:
- CPU inference 100x slower than GPU
- CPU makes local models unusable
- Even cheap GPU (RTX 3050 $150-200) vastly better than CPU
- Quantization + GPU = viable solution
-
Profile early and often:
- Profile inference latency Day 1
- Profile again Day 7
- Profile again Week 4
- Track trends, catch degradation early
- If latency increasing, debug immediately
-
Context window management:
- Don't give entire conversation to LLM
- Summarize old context, keep recent context fresh
- Limit context to last 10-20 messages
- Memory system provides relevant background, not raw history
-
Batch processing when possible:
- If 5 messages queued, process batch of 5
- vLLM supports batching (faster than sequential)
- Reduces overhead per message
-
Phase mapping: Testing from Phase 1, becomes critical Phase 2+
Pitfall: Memory Leak in Long-Running Bot
What goes wrong: Bot runs fine for days/weeks, then memory usage climbs and crashes:
- Day 1: 2GB RAM
- Day 7: 4GB RAM
- Day 14: 8GB RAM
- Day 21: out of memory, crashes
Root causes:
- Unclosed file handles (each message opens file, doesn't close)
- Circular references (objects reference each other, can't garbage collect)
- Old connection pools (database connections accumulate)
- Event listeners not removed (thousands of listeners accumulate)
- Caches growing unbounded (message cache grows every message)
Warning signs:
- Memory usage steadily increases over days
- Memory never drops back after spike
- Bot crashes at consistent memory level (always runs out)
- Restart fixes problem (temporarily)
Prevention strategies:
-
Periodic resource audits:
- Script that checks every hour
- Open file handles: should be < 10 at any time
- Active connections: should be < 5 at any time
- Cached items: should be < 1000 items (not 100k)
- Alert on resource leak patterns
-
Graceful shutdown and restart:
- Can restart bot without losing state
- Saves state before shutdown (to database)
- Restart cleans up all resources
- Schedule auto-restart weekly (preventative)
-
Connection pooling with limits:
- Database connections pooled (not created per query)
- Pool has max size (e.g., max 5 connections)
- Connections reused, not created new
- Old connections timeout/close
-
Explicit resource cleanup:
- Close files after reading (use
withstatements) - Unregister event listeners when done
- Clear old entries from caches
- Delete references to large objects when no longer needed
- Close files after reading (use
-
Bounded caches:
- Personality cache: max 10 entries
- Memory cache: max 1000 items (or N days)
- Conversation cache: max 100 messages
- When full, remove oldest entries
-
Regular restart schedule:
- Restart bot weekly (or daily if memory leak severe)
- State saved to database before restart
- Resume seamlessly after restart
- Preventative rather than reactive
-
Memory profiling tools:
- Use memory_profiler (Python)
- Identify which functions leak memory
- Fix leaks at source
-
Phase mapping: Production readiness (Phase 6, crucial for stability)
Logging and Monitoring Framework
Early Detection System
Personality consistency:
- Weekly: audit 10 random responses for tone consistency
- Monthly: statistical analysis of personality attributes (sarcasm %, helpfulness %, tsundere %)
- Flag if any attribute drifts >15% month-over-month
Memory health:
- Daily: count total memories (alert if > 10,000)
- Weekly: verify random samples (accuracy check)
- Monthly: memory usefulness audit (how often retrieved? how accurate?)
Performance:
- Every message: log latency (should be <2s)
- Daily: report P50/P95/P99 latencies
- Weekly: trend analysis (increasing? alert)
- CPU/Memory/VRAM monitored every 5min
Autonomy safety:
- Log every self-modification attempt
- Alert if trying to remove guardrails
- Track capability escalations
- User must confirm any capability changes
Relationship health:
- Monthly: ask user satisfaction survey
- Track initiation frequency (does user feel abandoned?)
- Track annoyance signals (short responses = bored/annoyed)
- Conversation quality metrics
Phases and Pitfalls Timeline
| Phase | Focus | Pitfalls to Watch | Mitigation |
|---|---|---|---|
| Phase 1 | Core text LLM, basic personality, memory foundation | LLM latency > 2s, personality inconsistency starts, memory bloat | Quantize model, establish personality baseline, memory hierarchy |
| Phase 2 | Personality deepening, memory integration, tsundere | Personality drift, hallucinations from old memories, over-applying tsun | Weekly personality audits, memory verification, tsundere balance metrics |
| Phase 3 | Perception (webcam/images), avatar sync | Multimodal latency kills responsiveness, avatar misalignment | Separate perception thread, async multimodal responses |
| Phase 4 | Proactive autonomy (initiates conversations) | One-way relationship if not careful, becoming annoying | Balance initiation frequency, emotional awareness, quiet mode |
| Phase 5 | Self-modification capability | Code drift, runaway changes, losing user control | Gamified progression, mandatory approval, sandboxed testing |
| Phase 6 | Production hardening | Memory leaks crash long-running bot, edge cases break personality | Resource monitoring, restart schedule, comprehensive testing |
Success Definition: Avoiding Pitfalls
When you've successfully avoided pitfalls, Hex will demonstrate:
Personality:
- Consistent tone across weeks/months (personality audit shows <5% drift)
- Tsundere balance maintained (30-70% denial ratio with escalating intimacy)
- Responses feel intentional, not random
Memory:
- User trusts her memories (accurate, not confabulated)
- Memory system efficient (responses still <2s after 1000 messages)
- Memories feel relevant, not overwhelming
Autonomy:
- User always feels in control (can disable any feature)
- Changes visible and understandable (clear diffs, explanations)
- No unexpected behavior (nothing breaks due to self-modification)
Integration:
- Responsive always (<2s Discord latency)
- Multimodal doesn't cause performance issues
- Avatar syncs with personality state
Relationship:
- Two-way connection (she initiates, shows genuine interest)
- Right amount of communication (never annoying, never silent)
- User feels cared for (not just served)
Technical:
- Stable over time (no degradation over weeks)
- Survives long uptimes (no memory leaks, crashes)
- Performs under load (scales as conversation grows)
Research Sources
This research incorporates findings from industry leaders on AI companion pitfalls:
- MIT Technology Review: AI Companions 2026 Breakthrough Technologies
- ISACA: Avoiding AI Pitfalls 2025-2026
- AI Multiple: Epic LLM/Chatbot Failures in 2026
- Stanford Report: AI Companions and Young People Risks
- MIT Technology Review: AI Chatbots and Privacy
- Mem0: Building Production-Ready AI Agents with Long-Term Memory
- OpenAI Community: Building Consistent AI Personas
- Dynamic Affective Memory Management for Personalized LLM Agents
- ISACA: Self-Modifying AI Risks
- Harvard: Chatbots' Emotionally Manipulative Tactics
- Wildflower Center: Chatbots Don't Do Empathy
- Psychology Today: Mental Health Dangers of AI Chatbots
- Pinecone: Fixing Hallucination with Knowledge Bases
- DataRobot: LLM Hallucinations and Agentic AI
- Airbyte: 8 Ways to Prevent LLM Hallucinations