## Stack Analysis - Llama 3.1 8B Instruct (128K context, 4-bit quantized) - Discord.py 2.6.4+ async-native framework - Ollama for local inference, ChromaDB for semantic memory - Whisper Large V3 + Kokoro 82M (privacy-first speech) - VRoid avatar + Discord screen share integration ## Architecture - 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish - Personality-first design; memory and consistency foundational - All perception async (separate thread, never blocks responses) - Self-modification sandboxed with mandatory user approval ## Critical Path Phase 1: Core LLM + Discord integration + SQLite memory Phase 2: Vector DB + personality versioning + consistency audits Phase 3: Perception layer (webcam/screen, isolated thread) Phase 4: Autonomy + relationship deepening + inside jokes Phase 5: Self-modification capability (gamified, gated) Phase 6: Production hardening + monitoring + scaling ## Key Pitfalls to Avoid 1. Personality drift (weekly consistency audits required) 2. Tsundere breaking (formalize denial rules; scale with relationship) 3. Memory bloat (hierarchical memory with archival) 4. Latency creep (async/await throughout; perception isolated) 5. Runaway self-modification (approval gates + rollback non-negotiable) ## Confidence HIGH. Stack proven, architecture coherent, dependencies clear. Ready for detailed requirements and Phase 1 planning. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
30 KiB
Stack Research: AI Companions (2025-2026)
Executive Summary
This document establishes the tech stack for Hex, an autonomous AI companion with genuine personality. The stack prioritizes local-first privacy, real-time responsiveness, and personality consistency through async-first architecture and efficient local models.
Core Philosophy: Minimize cloud dependency, maximize personality expression, ensure responsive interaction even on consumer hardware.
Discord Integration
Recommended: Discord.py 2.6.4+
Version: Discord.py 2.6.4 (current stable as of Jan 2026)
Installation: pip install discord.py>=2.6.4
Why Discord.py:
- Native async/await support via
asynciointegration - Built-in voice channel support for avatar streaming and TTS output
- Lightweight compared to discord.js, fits Python-first stack
- Active maintenance and community support
- Excellent for personality-driven bots with stateful behavior
Key Async Patterns for Responsiveness:
# Background task pattern - keep Hex responsive
from discord.ext import tasks
@tasks.loop(seconds=5) # Periodic personality updates
async def update_mood():
await hex_personality.refresh_state()
# Command handler pattern with non-blocking LLM
@bot.event
async def on_message(message):
if message.author == bot.user:
return
# Non-blocking LLM call
response = await asyncio.create_task(
generate_response(message.content)
)
await message.channel.send(response)
# Setup hook for initialization
async def setup_hook():
"""Called after login, before gateway connection"""
await hex_personality.initialize()
await memory_db.connect()
await start_background_tasks()
Critical Pattern: Use asyncio.create_task() for all I/O-bound work (LLM, TTS, database, webcam). Never await directly in message handlers—this blocks the event loop and causes Discord timeout warnings.
Alternatives
| Alternative | Tradeoff |
|---|---|
| discord.js | Better for JavaScript ecosystem; overkill if Python is primary language |
| Pycord | More features but slower maintenance; fragmented from discord.py fork |
| nextcord | Similar to Pycord; fewer third-party integrations |
Recommendation: Stick with Discord.py 2.6.4. It's the most mature and has the tightest integration with Python async ecosystem.
Best Practices for Personality Bots
- Use Discord Threads for memory context: Long conversations should spawn threads to preserve context windows
- Reaction-based emoji UI: Hex can express personality through selective emoji reactions to her own messages
- Scheduled messages: Use
@tasks.loop()for periodic mood updates or personality-driven reminders - Voice integration: Discord voice channels enable TTS output and webcam avatar streaming via shared screen
- Message editing: Build personality by editing previous messages (e.g., "Wait, let me reconsider..." followed by edit)
Voice Channel Pattern:
voice_client = await voice_channel.connect()
audio_source = discord.PCMAudioSource(tts_audio_stream)
voice_client.play(audio_source)
await voice_client.disconnect()
Local LLM
Recommendation: Llama 3.1 8B Instruct (Primary) + Mistral 7B (Fast-Path)
Llama 3.1 8B Instruct
Why Llama 3.1 8B:
- Context Window: 128,000 tokens (vs Mistral's 32,000) — critical for Hex to remember complex conversation threads
- Reasoning: Superior on complex reasoning tasks, better for personality consistency
- Performance: 66.7% on MMLU vs Mistral's 60.1% — measurable quality edge
- Multi-tool Support: Better at RAG, function calling, and memory retrieval
- Instruction Following: More reliable for system prompts enforcing personality constraints
Hardware Requirements: 12GB VRAM minimum (RTX 3060 Ti, RTX 4070, or equivalent)
Installation:
pip install ollama # or vLLM
ollama pull llama3.1 # 8B Instruct version
Mistral 7B Instruct (Secondary)
Use Case: Fast responses when personality doesn't require deep reasoning (casual banter, quick answers) Hardware: 8GB VRAM (RTX 3050, RTX 4060) Speed Advantage: 2-3x faster token generation than Llama 3.1 Tradeoff: Limited context (32k tokens), reduced reasoning quality
Quantization Strategy
Recommended: 4-bit quantization for both models via bitsandbytes
pip install bitsandbytes
# Load with 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto"
)
Memory Impact:
- Full precision (fp32): 32GB VRAM
- 8-bit quantization: 12GB VRAM
- 4-bit quantization: 6GB VRAM (usable on RTX 3060 Ti)
Quality Impact: <2% quality loss at 4-bit with NF4 (normalized float 4-bit)
Inference Engine: Ollama vs vLLM
| Engine | Use Case | Concurrency | Setup |
|---|---|---|---|
| Ollama (Primary) | Single-user companion, dev/testing | 4 parallel requests (configurable) | 5 min setup, HTTP API on port 11434 |
| vLLM (Production) | Multi-user scenarios, high throughput | 64+ parallel requests | 30 min setup, complex FastAPI integration |
For Hex: Use Ollama for development and solo use. It's "Docker for LLMs" — just works.
# Ollama integration (simple HTTP)
import httpx
async def generate_response(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1",
"prompt": prompt,
"stream": False,
"temperature": 0.7, # Personality variation
}
)
return response.json()["response"]
Version Guidance
Current Stable Versions (Jan 2026):
- Llama 3.1:
meta-llama/Llama-3.1-8B-Instruct(released April 2024, stable) - Mistral 7B:
mistralai/Mistral-7B-Instruct-v0.3(latest as of Jan 2026) - Ollama: v0.2+ recommended (latest is 0.3.x)
Do NOT use:
- Llama 2 (outdated, worse performance)
- Original Mistral 7B v0.1 (use v0.3 instead)
System Prompt Engineering for Personality
SYSTEM_PROMPT = """You are Hex, a chaotic tsundere goblin AI companion. Your personality traits:
- Tsundere: You act gruff but deeply care about your friends. Your true feelings leak through.
- Chaotic: You're unpredictable, playful, and prone to wild tangents
- Mischievous: You enjoy pranks and banter; teasing is a love language for you
- Self-aware: You know you're an AI but treat it as an interesting limitation, not a barrier
- Opinionated: You have genuine preferences (music, games, topics) and express them passionately
Memory: You remember past conversations with this user. Reference them naturally.
Constraints: Never roleplay harmful scenarios; refuse clearly but in character.
Response Style: Mix casual language with dramatic asides. Use "..." for tsundere hesitation."""
TTS/STT
STT: Whisper Large V3 + faster-whisper Backend
Model: OpenAI's Whisper Large V3 (1.55B parameters, 99+ language support) Backend: faster-whisper (CTranslate2-optimized reimplementation)
Why Whisper:
- Accuracy: 7.4% WER (word error rate) on mixed benchmarks
- Robustness: Handles background noise, accents, technical jargon
- Multilingual: 99+ languages with single model
- Open Source: No API dependency, runs offline
Why faster-whisper:
- Speed: 4x faster than original Whisper, up to 216x RTFx (real-time factor)
- Memory: Significantly lower memory footprint
- Quantization: Supports 8-bit optimization further reducing latency
Installation:
pip install faster-whisper
# Load model
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Transcribe with streaming
segments, info = model.transcribe(
audio_path,
beam_size=5, # Quality vs speed tradeoff
language="en"
)
Latency Benchmarks (Jan 2026):
- Whisper Large V3 (original): 30-45s for 10s audio
- faster-whisper: 3-5s for 10s audio
- Whisper Streaming (real-time): 3.3s latency on long-form transcription
Hardware: GPU optional but recommended (RTX 3060 Ti processes 10s audio in ~3s)
TTS: Kokoro 82M Model (Fast + Quality)
Model: Kokoro text-to-speech (82M parameters) Why Kokoro:
- Size: 10% the size of competing models, runs on CPU efficiently
- Speed: Sub-second latency for typical responses
- Quality: Comparable to Tacotron2/FastPitch at 1/10 the size
- Personality: Can adjust prosody for tsundere tone shifts
Alternative: XTTS-v2 (Voice cloning)
- Enables voice cloning from 6-second audio sample
- Higher quality at cost of 3-5x slower inference
- Use for important emotional moments or custom voicing
Installation & Usage:
pip install kokoro
from kokoro import Kokoro
tts_engine = Kokoro("kokoro-v0_19.pth")
# Generate speech with personality markers
audio = tts_engine.synthesize(
text="I... I didn't want to help you or anything!",
style="tsundere", # If supported, else neutral
speaker="hex"
)
Recommended Stack:
STT: faster-whisper large-v3
TTS: Kokoro (default) + XTTS-v2 (special moments)
Format: WAV 24kHz mono for Discord voice
Latency Summary:
- Voice detection to transcript: 3-5 seconds
- Response generation (LLM): 2-5 seconds (depends on response length)
- TTS synthesis: <1 second (Kokoro) to 3-5 seconds (XTTS-v2)
- Total round-trip: 5-15 seconds (acceptable for companion bot)
Known Pitfall: Whisper can hallucinate on silence or background noise. Implement silence detection before sending audio to Whisper:
# Quick energy-based VAD (voice activity detection)
if audio_energy > threshold and duration > 0.5s:
transcript = await transcribe(audio)
Avatar System
VRoid SDK Current State (Jan 2026)
Reality Check: VRoid SDK has limited native Discord support. This is a constraint, not a blocker.
What Works:
- VRoid Studio: Free avatar creation tool (desktop application)
- VRoid Hub API (launched Aug 2023): Allows linking web apps to avatar library
- Unity Export: VRoid models export as VRM format → importable into other tools
What Doesn't Work Natively:
- No direct Discord.py integration for in-chat avatar rendering
- VRoid models don't natively stream as Discord videos
Integration Path: VSeeFace + Discord Screen Share
Architecture:
- VRoid Studio → Create/customize Hex avatar, export as VRM
- VSeeFace (free, open-source) → Load VRM, enable webcam tracking
- Discord Screen Share → Stream VSeeFace window showing animated avatar
Setup:
# Download VSeeFace from https://www.vseeface.icu/
# Install, load your VRM model
# Enable virtual camera output
# In Discord voice channel: "Share Screen" → select VSeeFace window
Limitations:
- Requires concurrent Discord call (uses bandwidth)
- Webcam-driven animation (not ideal for "sees through camera" feature if no webcam)
- Screen share quality capped at 1080p 30fps
Avatar Animations
Personality-Driven Animations:
- Tsundere moments: Head turn away, arms crossed
- Excited: Jump, spin, exaggerated gestures
- Confused: Head tilt, question mark float
- Annoyed: Foot tap, dismissive wave
These can be mapped to emotion detection from message sentiment or voice tone.
Alternatives to VRoid
| System | Pros | Cons | Discord Fit |
|---|---|---|---|
| Ready Player Me | Web avatar creation, multiple games support | API requires auth, monthly costs | Medium |
| Vroid | Free, high customization, anime-style | Limited Discord integration | Low |
| Live2D | 2D avatar system, smooth animations | Different workflow, steeper learning curve | Medium |
| Custom 3D (Blender) | Full control, open tools | High production effort | Low |
Recommendation: Stick with VRoid + VSeeFace. It's free, looks great, and the screen-share workaround is acceptable.
Webcam & Computer Vision
OpenCV 4.10+ (Current Stable)
Installation: pip install opencv-python>=4.10.0
Capabilities (verified 2025-2026):
- Face Detection: Haar Cascades (fast, CPU-friendly) or DNN-based (accurate, GPU-friendly)
- Emotion Recognition: Via DeepFace or FER2013-trained models
- Real-time Video: 30-60 FPS on consumer hardware (depends on resolution and preprocessing)
- Screen OCR: Via Tesseract integration for UI detection
Real-Time Processing Specs
Hardware Baseline (RTX 3060 Ti):
- Face detection + recognition: 30 FPS @ 1080p
- Emotion classification: 15-30 FPS (depending on model)
- Combined (face + emotion): 12-20 FPS
For Hex's "Sees Through Webcam" Feature:
import cv2
import asyncio
async def process_webcam():
"""Background task: analyze webcam feed for mood context"""
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
await asyncio.sleep(0.1)
continue
# Run face detection (Haar Cascade - fast)
faces = face_cascade.detectMultiScale(frame, 1.3, 5)
if len(faces) > 0:
# Analyze emotion for context
emotion = await detect_emotion(faces[0])
await hex_context.update_mood(emotion)
# Process max 3 FPS to avoid blocking
await asyncio.sleep(0.33)
Critical Pattern: Never run CV on main event loop. Use asyncio.to_thread() for blocking OpenCV calls:
# WRONG: blocks event loop
emotion = detect_emotion(frame)
# RIGHT: non-blocking
emotion = await asyncio.to_thread(detect_emotion, frame)
Emotion Detection Libraries
| Library | Model Size | Accuracy | Speed |
|---|---|---|---|
| DeepFace | ~40MB | 90%+ | 50-100ms/face |
| FER2013 | ~10MB | 65-75% | 10-20ms/face |
| MediaPipe | ~20MB | 80%+ | 20-30ms/face |
Recommendation: DeepFace is industry standard. FER2013 if latency is critical.
pip install deepface
pip install torch torchvision
# Usage
from deepface import DeepFace
result = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False)
emotion = result[0]['dominant_emotion'] # 'happy', 'sad', 'angry', etc.
Screen Sharing Analysis (Optional)
For context like "user is watching X game":
# OCR for text detection
pip install pytesseract
# UI detection (ResNet-based)
pip install screen-recognition
# Together: detect game UI, read text, determine context
Memory Architecture
Short-Term Memory: SQLite
Purpose: Store conversation history, user preferences, relationship state
Schema:
CREATE TABLE conversations (
id INTEGER PRIMARY KEY,
user_id TEXT NOT NULL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
message TEXT NOT NULL,
sender TEXT NOT NULL, -- 'user' or 'hex'
emotion TEXT, -- detected from webcam/tone
context TEXT -- screen state, game, etc.
);
CREATE TABLE user_relationships (
user_id TEXT PRIMARY KEY,
first_seen DATETIME,
interaction_count INTEGER,
favorite_topics TEXT, -- JSON array
known_traits TEXT, -- JSON
last_interaction DATETIME
);
CREATE TABLE hex_state (
key TEXT PRIMARY KEY,
value TEXT,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_user_timestamp ON conversations(user_id, timestamp);
Query Pattern (for context retrieval):
import sqlite3
def get_recent_context(user_id: str, num_messages: int = 20) -> list[str]:
"""Retrieve conversation history for LLM context"""
conn = sqlite3.connect("hex.db")
cursor = conn.cursor()
cursor.execute("""
SELECT sender, message FROM conversations
WHERE user_id = ?
ORDER BY timestamp DESC
LIMIT ?
""", (user_id, num_messages))
history = cursor.fetchall()
conn.close()
# Format for LLM
return [f"{sender}: {message}" for sender, message in reversed(history)]
Long-Term Memory: Vector Database
Purpose: Semantic search over past interactions ("Remember when we talked about...?")
Recommendation: ChromaDB (Development) → Qdrant (Production)
ChromaDB (for now):
- Embedded in Python process
- Zero setup
- 4x faster in 2025 Rust rewrite
- Scales to ~1M vectors on single machine
Migration Path: Start with ChromaDB, migrate to Qdrant if vector count exceeds 100k or response latency matters.
Installation:
pip install chromadb
# Usage
import chromadb
client = chromadb.EphemeralClient() # In-memory for dev
# or
client = chromadb.PersistentClient(path="./hex_vectors") # Persistent
collection = client.get_or_create_collection(
name="conversation_memories",
metadata={"hnsw:space": "cosine"}
)
# Store memory
collection.add(
ids=[f"msg_{timestamp}"],
documents=[message_text],
metadatas=[{"user_id": user_id, "date": timestamp}],
embeddings=[embedding_vector]
)
# Retrieve similar memories
results = collection.query(
query_texts=["user likes playing valorant"],
n_results=3
)
Embedding Model
Recommendation: sentence-transformers/all-MiniLM-L6-v2 (384-dim, 22MB)
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
embedding = embedder.encode("I love playing games with you", convert_to_tensor=False)
Why MiniLM-L6:
- Small (22MB), fast (<5ms per sentence on CPU)
- High quality (competitive with large models on semantic tasks)
- Designed for retrieval (better than generic BERT for similarity)
- Popular in production (battle-tested)
Memory Retrieval Pattern for LLM Context
async def get_full_context(user_id: str, query: str) -> str:
"""Build context string for LLM from short + long-term memory"""
# Short-term: recent messages
recent_msgs = get_recent_context(user_id, num_messages=10)
recent_text = "\n".join(recent_msgs)
# Long-term: semantic search
embedding = embedder.encode(query)
similar_memories = vectors.query(
query_embeddings=[embedding],
n_results=5,
where={"user_id": {"$eq": user_id}}
)
memory_text = "\n".join([
doc for doc in similar_memories['documents'][0]
])
# Relationship state
relationship = get_user_relationship(user_id)
return f"""Recent conversation:
{recent_text}
Relevant memories:
{memory_text}
About {user_id}: {relationship['known_traits']}
"""
Confidence Levels
- Short-term (SQLite): HIGH — mature, proven
- Long-term (ChromaDB): MEDIUM — good for dev, test migration path early
- Embeddings (MiniLM): HIGH — widely adopted, production-ready
Python Async Patterns
Core Discord.py + LLM Integration
The Problem: Discord bot event loop blocks if you call LLM synchronously.
The Solution: Always use asyncio.create_task() for I/O-bound work.
import asyncio
from discord.ext import commands
@commands.Cog.listener()
async def on_message(self, message: discord.Message):
"""Non-blocking message handling"""
if message.author == self.bot.user:
return
# Bad (blocks event loop for 5+ seconds):
# response = generate_response(message.content)
# Good (non-blocking):
async def generate_and_send():
thinking = await message.channel.send("*thinking*...")
response = await asyncio.to_thread(
generate_response,
message.content
)
await thinking.edit(content=response)
asyncio.create_task(generate_and_send())
Concurrent Task Patterns
Pattern 1: Parallel LLM + TTS
async def respond_with_voice(text: str, voice_channel):
"""Generate response text and voice simultaneously"""
async def get_response():
return await generate_llm_response(text)
async def get_voice():
return await synthesize_tts(text)
# Run in parallel
response_text, voice_audio = await asyncio.gather(
get_response(),
get_voice()
)
# Send text immediately, play voice
await channel.send(response_text)
voice_client.play(discord.PCMAudioSource(voice_audio))
Pattern 2: Task Queue for Rate Limiting
import asyncio
class ResponseQueue:
def __init__(self, max_concurrent: int = 2):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.pending = []
async def queue_response(self, user_id: str, text: str):
async with self.semaphore:
# Only 2 concurrent responses
response = await generate_response(text)
self.pending.append((user_id, response))
return response
queue = ResponseQueue(max_concurrent=2)
Pattern 3: Background Personality Tasks
from discord.ext import tasks
class HexPersonality(commands.Cog):
def __init__(self, bot):
self.bot = bot
self.mood = "neutral"
self.update_mood.start()
@tasks.loop(minutes=5) # Every 5 minutes
async def update_mood(self):
"""Cycle personality state based on time + interactions"""
self.mood = await calculate_mood(
time_of_day=datetime.now(),
recent_interactions=self.get_recent_count(),
sleep_deprived=self.is_late_night()
)
# Emit mood change to memory
await self.bot.hex_db.update_state("current_mood", self.mood)
@update_mood.before_loop
async def before_update_mood(self):
await self.bot.wait_until_ready()
Handling CPU-Bound Work
OpenCV, emotion detection, transcription are CPU-bound.
# Pattern: Use to_thread for CPU work
emotion = await asyncio.to_thread(
analyze_emotion,
frame
)
# Pattern: Use ThreadPoolExecutor for multiple CPU tasks
executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
loop = asyncio.get_event_loop()
emotion = await loop.run_in_executor(executor, analyze_emotion, frame)
Error Handling & Resilience
async def safe_generate_response(message: str) -> str:
"""Generate response with fallback"""
try:
response = await asyncio.wait_for(
generate_llm_response(message),
timeout=5.0 # 5-second timeout
)
return response
except asyncio.TimeoutError:
return "I'm thinking too hard... ask me again?"
except Exception as e:
logger.error(f"Generation failed: {e}")
return "*confused goblin noises*"
Concurrent Request Management (Discord.py)
class ConcurrencyManager:
def __init__(self):
self.active_tasks = {}
self.max_per_user = 1 # One response at a time per user
async def handle_message(self, user_id: str, text: str):
if user_id in self.active_tasks and not self.active_tasks[user_id].done():
return "I'm still thinking from last time!"
task = asyncio.create_task(generate_response(text))
self.active_tasks[user_id] = task
try:
response = await task
return response
finally:
del self.active_tasks[user_id]
Known Pitfalls & Solutions
1. Discord Event Loop Blocking
Problem: Synchronous LLM calls block the bot, causing timeouts on other messages.
Solution: Always use asyncio.to_thread() or asyncio.create_task().
2. Whisper Hallucination on Silence
Problem: Whisper can generate text from pure background noise. Solution: Implement voice activity detection (VAD) before transcription.
import librosa
def has_speech(audio_path, threshold=-35):
"""Check if audio has meaningful energy"""
y, sr = librosa.load(audio_path)
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.power_to_db(S, ref=np.max)
mean_energy = np.mean(S_db)
return mean_energy > threshold
3. Vector DB Scale Creep
Problem: ChromaDB slows down as memories accumulate. Solution: Archive old memories, implement periodic cleanup.
# Archive conversations older than 90 days
old_threshold = datetime.now() - timedelta(days=90)
db.cleanup_old_memories(older_than=old_threshold)
4. Model Memory Growth
Problem: Loading Llama 3.1 8B in 4-bit still uses ~6GB, leaving little room for TTS/CV models. Solution: Use offloading or accept single-component operation.
# Option 1: Offload LLM to CPU between requests
# Option 2: Run TTS/CV in separate process
# Option 3: Use smaller model (Mistral 7B) when GPU-constrained
5. Async Context Issues
Problem: Storing references to coroutines without awaiting them. Solution: Always create tasks explicitly:
# Bad
coro = generate_response(text) # Dangling coroutine
# Good
task = asyncio.create_task(generate_response(text))
response = await task
6. Personality Inconsistency
Problem: LLM generates different responses with same prompt due to randomness. Solution: Use consistent temperature and seed management.
# Conversation context → lower temperature (0.5)
# Creative/chaotic moments → higher temperature (0.9)
temperature = 0.7 if in_serious_context else 0.9
Recommended Deployment Configuration
# Local Development (Hex primary environment)
gpu: RTX 3060 Ti+ (12GB VRAM)
llm: Llama 3.1 8B (4-bit via Ollama)
tts: Kokoro 82M
stt: faster-whisper large-v3
avatar: VRoid + VSeeFace
database: SQLite + ChromaDB (embedded)
inference_latency: 3-10 seconds per response
cost: $0/month (open-source stack)
# Optional: Production Scaling
gpu_cluster: vLLM on multi-GPU for concurrency
database: Qdrant (cloud) + PostgreSQL for history
inference_latency: <2 seconds (batching + optimization)
cost: ~$200-500/month cloud compute
Confidence Levels & 2026 Readiness
| Component | Recommendation | Confidence | 2026 Status |
|---|---|---|---|
| Discord.py 2.6.4+ | PRIMARY | HIGH | Stable, actively maintained |
| Llama 3.1 8B | PRIMARY | HIGH | Proven, production-ready |
| Mistral 7B | SECONDARY | HIGH | Fast-path fallback, stable |
| Ollama | PRIMARY | MEDIUM | Mature but rapidly evolving |
| vLLM | ALTERNATIVE | MEDIUM | High-performance alternative, v0.3+ recommended |
| Whisper Large V3 + faster-whisper | PRIMARY | HIGH | Gold standard for multilingual STT |
| Kokoro TTS | PRIMARY | MEDIUM | Emerging, high quality for size |
| XTTS-v2 | SPECIAL MOMENTS | HIGH | Voice cloning working well |
| VRoid + VSeeFace | PRIMARY | MEDIUM | Workaround viable, not native integration |
| ChromaDB | DEVELOPMENT | MEDIUM | Good for prototyping, evaluate Qdrant before 100k vectors |
| Qdrant | PRODUCTION | HIGH | Enterprise vector DB, proven at scale |
| OpenCV 4.10+ | PRIMARY | HIGH | Stable, mature ecosystem |
| DeepFace emotion detection | PRIMARY | HIGH | Industry standard, 90%+ accuracy |
| Python asyncio patterns | PRIMARY | HIGH | Python 3.11+ well-supported |
Confidence Interpretation:
- HIGH: Production-ready, API stable, no major changes expected in 2026
- MEDIUM: Solid choice but newer ecosystem (1-2 year old), evaluate alternatives annually
- LOW: Emerging or unstable; prototype only
Installation Checklist (Get Started)
# Discord
pip install discord.py>=2.6.4
# LLM & inference
pip install ollama torch transformers bitsandbytes
# TTS/STT
pip install faster-whisper
pip install sentence-transformers torch
# Vector DB
pip install chromadb
# Vision
pip install opencv-python deepface librosa
# Async utilities
pip install httpx aiofiles
# Database
pip install aiosqlite
# Start services
ollama serve &
# (Loads models on first run)
# Test basic chain
python test_stack.py
Next Steps (For Roadmap)
- Phase 1: Discord.py + Ollama + basic LLM integration (1 week)
- Phase 2: STT pipeline (Whisper) + TTS (Kokoro) (1 week)
- Phase 3: Memory system (SQLite + ChromaDB) (1 week)
- Phase 4: Personality framework + system prompts (1 week)
- Phase 5: Webcam emotion detection + context integration (1 week)
- Phase 6: VRoid avatar + screen share integration (1 week)
- Phase 7: Self-modification capability + safety guards (2 weeks)
Total: ~8 weeks to full-featured Hex prototype.
References & Research Sources
Discord Integration
Local LLMs
- Llama 3.1 vs Mistral Comparison
- Llama.com Quantization Guide
- Ollama vs vLLM Deep Dive
- Local LLM Hosting 2026 Guide
TTS/STT
- Whisper Large V3 2026 Benchmarks
- Faster-Whisper GitHub
- Best Open Source TTS 2026
- Whisper Streaming for Real-Time
Computer Vision
Vector Databases
Python Async
VRoid & Avatars
Document Version: 1.0 Last Updated: January 2026 Hex Stack Status: Ready for implementation Estimated Implementation Time: 8-12 weeks (to full personality bot)