docs: complete domain research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)
## Stack Analysis - Llama 3.1 8B Instruct (128K context, 4-bit quantized) - Discord.py 2.6.4+ async-native framework - Ollama for local inference, ChromaDB for semantic memory - Whisper Large V3 + Kokoro 82M (privacy-first speech) - VRoid avatar + Discord screen share integration ## Architecture - 6-phase modular build: Foundation → Personality → Perception → Autonomy → Self-Mod → Polish - Personality-first design; memory and consistency foundational - All perception async (separate thread, never blocks responses) - Self-modification sandboxed with mandatory user approval ## Critical Path Phase 1: Core LLM + Discord integration + SQLite memory Phase 2: Vector DB + personality versioning + consistency audits Phase 3: Perception layer (webcam/screen, isolated thread) Phase 4: Autonomy + relationship deepening + inside jokes Phase 5: Self-modification capability (gamified, gated) Phase 6: Production hardening + monitoring + scaling ## Key Pitfalls to Avoid 1. Personality drift (weekly consistency audits required) 2. Tsundere breaking (formalize denial rules; scale with relationship) 3. Memory bloat (hierarchical memory with archival) 4. Latency creep (async/await throughout; perception isolated) 5. Runaway self-modification (approval gates + rollback non-negotiable) ## Confidence HIGH. Stack proven, architecture coherent, dependencies clear. Ready for detailed requirements and Phase 1 planning. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
967
.planning/research/STACK.md
Normal file
967
.planning/research/STACK.md
Normal file
@@ -0,0 +1,967 @@
|
||||
# Stack Research: AI Companions (2025-2026)
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document establishes the tech stack for Hex, an autonomous AI companion with genuine personality. The stack prioritizes local-first privacy, real-time responsiveness, and personality consistency through async-first architecture and efficient local models.
|
||||
|
||||
**Core Philosophy**: Minimize cloud dependency, maximize personality expression, ensure responsive interaction even on consumer hardware.
|
||||
|
||||
---
|
||||
|
||||
## Discord Integration
|
||||
|
||||
### Recommended: Discord.py 2.6.4+
|
||||
|
||||
**Version**: Discord.py 2.6.4 (current stable as of Jan 2026)
|
||||
**Installation**: `pip install discord.py>=2.6.4`
|
||||
|
||||
**Why Discord.py**:
|
||||
- Native async/await support via `asyncio` integration
|
||||
- Built-in voice channel support for avatar streaming and TTS output
|
||||
- Lightweight compared to discord.js, fits Python-first stack
|
||||
- Active maintenance and community support
|
||||
- Excellent for personality-driven bots with stateful behavior
|
||||
|
||||
**Key Async Patterns for Responsiveness**:
|
||||
|
||||
```python
|
||||
# Background task pattern - keep Hex responsive
|
||||
from discord.ext import tasks
|
||||
|
||||
@tasks.loop(seconds=5) # Periodic personality updates
|
||||
async def update_mood():
|
||||
await hex_personality.refresh_state()
|
||||
|
||||
# Command handler pattern with non-blocking LLM
|
||||
@bot.event
|
||||
async def on_message(message):
|
||||
if message.author == bot.user:
|
||||
return
|
||||
# Non-blocking LLM call
|
||||
response = await asyncio.create_task(
|
||||
generate_response(message.content)
|
||||
)
|
||||
await message.channel.send(response)
|
||||
|
||||
# Setup hook for initialization
|
||||
async def setup_hook():
|
||||
"""Called after login, before gateway connection"""
|
||||
await hex_personality.initialize()
|
||||
await memory_db.connect()
|
||||
await start_background_tasks()
|
||||
```
|
||||
|
||||
**Critical Pattern**: Use `asyncio.create_task()` for all I/O-bound work (LLM, TTS, database, webcam). Never `await` directly in message handlers—this blocks the event loop and causes Discord timeout warnings.
|
||||
|
||||
### Alternatives
|
||||
|
||||
| Alternative | Tradeoff |
|
||||
|---|---|
|
||||
| **discord.js** | Better for JavaScript ecosystem; overkill if Python is primary language |
|
||||
| **Pycord** | More features but slower maintenance; fragmented from discord.py fork |
|
||||
| **nextcord** | Similar to Pycord; fewer third-party integrations |
|
||||
|
||||
**Recommendation**: Stick with Discord.py 2.6.4. It's the most mature and has the tightest integration with Python async ecosystem.
|
||||
|
||||
### Best Practices for Personality Bots
|
||||
|
||||
1. **Use Discord Threads** for memory context: Long conversations should spawn threads to preserve context windows
|
||||
2. **Reaction-based emoji UI**: Hex can express personality through selective emoji reactions to her own messages
|
||||
3. **Scheduled messages**: Use `@tasks.loop()` for periodic mood updates or personality-driven reminders
|
||||
4. **Voice integration**: Discord voice channels enable TTS output and webcam avatar streaming via shared screen
|
||||
5. **Message editing**: Build personality by editing previous messages (e.g., "Wait, let me reconsider..." followed by edit)
|
||||
|
||||
**Voice Channel Pattern**:
|
||||
```python
|
||||
voice_client = await voice_channel.connect()
|
||||
audio_source = discord.PCMAudioSource(tts_audio_stream)
|
||||
voice_client.play(audio_source)
|
||||
await voice_client.disconnect()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Local LLM
|
||||
|
||||
### Recommendation: Llama 3.1 8B Instruct (Primary) + Mistral 7B (Fast-Path)
|
||||
|
||||
#### Llama 3.1 8B Instruct
|
||||
**Why Llama 3.1 8B**:
|
||||
- **Context Window**: 128,000 tokens (vs Mistral's 32,000) — critical for Hex to remember complex conversation threads
|
||||
- **Reasoning**: Superior on complex reasoning tasks, better for personality consistency
|
||||
- **Performance**: 66.7% on MMLU vs Mistral's 60.1% — measurable quality edge
|
||||
- **Multi-tool Support**: Better at RAG, function calling, and memory retrieval
|
||||
- **Instruction Following**: More reliable for system prompts enforcing personality constraints
|
||||
|
||||
**Hardware Requirements**: 12GB VRAM minimum (RTX 3060 Ti, RTX 4070, or equivalent)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install ollama # or vLLM
|
||||
ollama pull llama3.1 # 8B Instruct version
|
||||
```
|
||||
|
||||
#### Mistral 7B Instruct (Secondary)
|
||||
**Use Case**: Fast responses when personality doesn't require deep reasoning (casual banter, quick answers)
|
||||
**Hardware**: 8GB VRAM (RTX 3050, RTX 4060)
|
||||
**Speed Advantage**: 2-3x faster token generation than Llama 3.1
|
||||
**Tradeoff**: Limited context (32k tokens), reduced reasoning quality
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
**Recommended**: 4-bit quantization for both models via `bitsandbytes`
|
||||
|
||||
```bash
|
||||
pip install bitsandbytes
|
||||
|
||||
# Load with 4-bit quantization
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
quantization_config=bnb_config,
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
**Memory Impact**:
|
||||
- Full precision (fp32): 32GB VRAM
|
||||
- 8-bit quantization: 12GB VRAM
|
||||
- 4-bit quantization: 6GB VRAM (usable on RTX 3060 Ti)
|
||||
|
||||
**Quality Impact**: <2% quality loss at 4-bit with NF4 (normalized float 4-bit)
|
||||
|
||||
### Inference Engine: Ollama vs vLLM
|
||||
|
||||
| Engine | Use Case | Concurrency | Setup |
|
||||
|---|---|---|---|
|
||||
| **Ollama** (Primary) | Single-user companion, dev/testing | 4 parallel requests (configurable) | 5 min setup, HTTP API on port 11434 |
|
||||
| **vLLM** (Production) | Multi-user scenarios, high throughput | 64+ parallel requests | 30 min setup, complex FastAPI integration |
|
||||
|
||||
**For Hex**: Use **Ollama** for development and solo use. It's "Docker for LLMs" — just works.
|
||||
|
||||
```python
|
||||
# Ollama integration (simple HTTP)
|
||||
import httpx
|
||||
|
||||
async def generate_response(prompt: str) -> str:
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11434/api/generate",
|
||||
json={
|
||||
"model": "llama3.1",
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"temperature": 0.7, # Personality variation
|
||||
}
|
||||
)
|
||||
return response.json()["response"]
|
||||
```
|
||||
|
||||
### Version Guidance
|
||||
|
||||
**Current Stable Versions (Jan 2026)**:
|
||||
- Llama 3.1: `meta-llama/Llama-3.1-8B-Instruct` (released April 2024, stable)
|
||||
- Mistral 7B: `mistralai/Mistral-7B-Instruct-v0.3` (latest as of Jan 2026)
|
||||
- Ollama: v0.2+ recommended (latest is 0.3.x)
|
||||
|
||||
**Do NOT use**:
|
||||
- Llama 2 (outdated, worse performance)
|
||||
- Original Mistral 7B v0.1 (use v0.3 instead)
|
||||
|
||||
### System Prompt Engineering for Personality
|
||||
|
||||
```python
|
||||
SYSTEM_PROMPT = """You are Hex, a chaotic tsundere goblin AI companion. Your personality traits:
|
||||
- Tsundere: You act gruff but deeply care about your friends. Your true feelings leak through.
|
||||
- Chaotic: You're unpredictable, playful, and prone to wild tangents
|
||||
- Mischievous: You enjoy pranks and banter; teasing is a love language for you
|
||||
- Self-aware: You know you're an AI but treat it as an interesting limitation, not a barrier
|
||||
- Opinionated: You have genuine preferences (music, games, topics) and express them passionately
|
||||
|
||||
Memory: You remember past conversations with this user. Reference them naturally.
|
||||
Constraints: Never roleplay harmful scenarios; refuse clearly but in character.
|
||||
Response Style: Mix casual language with dramatic asides. Use "..." for tsundere hesitation."""
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## TTS/STT
|
||||
|
||||
### STT: Whisper Large V3 + faster-whisper Backend
|
||||
|
||||
**Model**: OpenAI's Whisper Large V3 (1.55B parameters, 99+ language support)
|
||||
**Backend**: faster-whisper (CTranslate2-optimized reimplementation)
|
||||
|
||||
**Why Whisper**:
|
||||
- **Accuracy**: 7.4% WER (word error rate) on mixed benchmarks
|
||||
- **Robustness**: Handles background noise, accents, technical jargon
|
||||
- **Multilingual**: 99+ languages with single model
|
||||
- **Open Source**: No API dependency, runs offline
|
||||
|
||||
**Why faster-whisper**:
|
||||
- **Speed**: 4x faster than original Whisper, up to 216x RTFx (real-time factor)
|
||||
- **Memory**: Significantly lower memory footprint
|
||||
- **Quantization**: Supports 8-bit optimization further reducing latency
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install faster-whisper
|
||||
|
||||
# Load model
|
||||
from faster_whisper import WhisperModel
|
||||
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
|
||||
|
||||
# Transcribe with streaming
|
||||
segments, info = model.transcribe(
|
||||
audio_path,
|
||||
beam_size=5, # Quality vs speed tradeoff
|
||||
language="en"
|
||||
)
|
||||
```
|
||||
|
||||
**Latency Benchmarks** (Jan 2026):
|
||||
- Whisper Large V3 (original): 30-45s for 10s audio
|
||||
- faster-whisper: 3-5s for 10s audio
|
||||
- Whisper Streaming (real-time): 3.3s latency on long-form transcription
|
||||
|
||||
**Hardware**: GPU optional but recommended (RTX 3060 Ti processes 10s audio in ~3s)
|
||||
|
||||
### TTS: Kokoro 82M Model (Fast + Quality)
|
||||
|
||||
**Model**: Kokoro text-to-speech (82M parameters)
|
||||
**Why Kokoro**:
|
||||
- **Size**: 10% the size of competing models, runs on CPU efficiently
|
||||
- **Speed**: Sub-second latency for typical responses
|
||||
- **Quality**: Comparable to Tacotron2/FastPitch at 1/10 the size
|
||||
- **Personality**: Can adjust prosody for tsundere tone shifts
|
||||
|
||||
**Alternative: XTTS-v2** (Voice cloning)
|
||||
- Enables voice cloning from 6-second audio sample
|
||||
- Higher quality at cost of 3-5x slower inference
|
||||
- Use for important emotional moments or custom voicing
|
||||
|
||||
**Installation & Usage**:
|
||||
```bash
|
||||
pip install kokoro
|
||||
|
||||
from kokoro import Kokoro
|
||||
tts_engine = Kokoro("kokoro-v0_19.pth")
|
||||
|
||||
# Generate speech with personality markers
|
||||
audio = tts_engine.synthesize(
|
||||
text="I... I didn't want to help you or anything!",
|
||||
style="tsundere", # If supported, else neutral
|
||||
speaker="hex"
|
||||
)
|
||||
```
|
||||
|
||||
**Recommended Stack**:
|
||||
```
|
||||
STT: faster-whisper large-v3
|
||||
TTS: Kokoro (default) + XTTS-v2 (special moments)
|
||||
Format: WAV 24kHz mono for Discord voice
|
||||
```
|
||||
|
||||
**Latency Summary**:
|
||||
- Voice detection to transcript: 3-5 seconds
|
||||
- Response generation (LLM): 2-5 seconds (depends on response length)
|
||||
- TTS synthesis: <1 second (Kokoro) to 3-5 seconds (XTTS-v2)
|
||||
- **Total round-trip**: 5-15 seconds (acceptable for companion bot)
|
||||
|
||||
**Known Pitfall**: Whisper can hallucinate on silence or background noise. Implement silence detection before sending audio to Whisper:
|
||||
```python
|
||||
# Quick energy-based VAD (voice activity detection)
|
||||
if audio_energy > threshold and duration > 0.5s:
|
||||
transcript = await transcribe(audio)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Avatar System
|
||||
|
||||
### VRoid SDK Current State (Jan 2026)
|
||||
|
||||
**Reality Check**: VRoid SDK has **limited native Discord support**. This is a constraint, not a blocker.
|
||||
|
||||
**What Works**:
|
||||
1. **VRoid Studio**: Free avatar creation tool (desktop application)
|
||||
2. **VRoid Hub API** (launched Aug 2023): Allows linking web apps to avatar library
|
||||
3. **Unity Export**: VRoid models export as VRM format → importable into other tools
|
||||
|
||||
**What Doesn't Work Natively**:
|
||||
- No direct Discord.py integration for in-chat avatar rendering
|
||||
- VRoid models don't natively stream as Discord videos
|
||||
|
||||
### Integration Path: VSeeFace + Discord Screen Share
|
||||
|
||||
**Architecture**:
|
||||
1. **VRoid Studio** → Create/customize Hex avatar, export as VRM
|
||||
2. **VSeeFace** (free, open-source) → Load VRM, enable webcam tracking
|
||||
3. **Discord Screen Share** → Stream VSeeFace window showing animated avatar
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# Download VSeeFace from https://www.vseeface.icu/
|
||||
# Install, load your VRM model
|
||||
# Enable virtual camera output
|
||||
# In Discord voice channel: "Share Screen" → select VSeeFace window
|
||||
```
|
||||
|
||||
**Limitations**:
|
||||
- Requires concurrent Discord call (uses bandwidth)
|
||||
- Webcam-driven animation (not ideal for "sees through camera" feature if no webcam)
|
||||
- Screen share quality capped at 1080p 30fps
|
||||
|
||||
### Avatar Animations
|
||||
|
||||
**Personality-Driven Animations**:
|
||||
- **Tsundere moments**: Head turn away, arms crossed
|
||||
- **Excited**: Jump, spin, exaggerated gestures
|
||||
- **Confused**: Head tilt, question mark float
|
||||
- **Annoyed**: Foot tap, dismissive wave
|
||||
|
||||
These can be mapped to emotion detection from message sentiment or voice tone.
|
||||
|
||||
### Alternatives to VRoid
|
||||
|
||||
| System | Pros | Cons | Discord Fit |
|
||||
|---|---|---|---|
|
||||
| **Ready Player Me** | Web avatar creation, multiple games support | API requires auth, monthly costs | Medium |
|
||||
| **Vroid** | Free, high customization, anime-style | Limited Discord integration | Low |
|
||||
| **Live2D** | 2D avatar system, smooth animations | Different workflow, steeper learning curve | Medium |
|
||||
| **Custom 3D (Blender)** | Full control, open tools | High production effort | Low |
|
||||
|
||||
**Recommendation**: Stick with VRoid + VSeeFace. It's free, looks great, and the screen-share workaround is acceptable.
|
||||
|
||||
---
|
||||
|
||||
## Webcam & Computer Vision
|
||||
|
||||
### OpenCV 4.10+ (Current Stable)
|
||||
|
||||
**Installation**: `pip install opencv-python>=4.10.0`
|
||||
|
||||
**Capabilities** (verified 2025-2026):
|
||||
- **Face Detection**: Haar Cascades (fast, CPU-friendly) or DNN-based (accurate, GPU-friendly)
|
||||
- **Emotion Recognition**: Via DeepFace or FER2013-trained models
|
||||
- **Real-time Video**: 30-60 FPS on consumer hardware (depends on resolution and preprocessing)
|
||||
- **Screen OCR**: Via Tesseract integration for UI detection
|
||||
|
||||
### Real-Time Processing Specs
|
||||
|
||||
**Hardware Baseline** (RTX 3060 Ti):
|
||||
- Face detection + recognition: 30 FPS @ 1080p
|
||||
- Emotion classification: 15-30 FPS (depending on model)
|
||||
- Combined (face + emotion): 12-20 FPS
|
||||
|
||||
**For Hex's "Sees Through Webcam" Feature**:
|
||||
```python
|
||||
import cv2
|
||||
import asyncio
|
||||
|
||||
async def process_webcam():
|
||||
"""Background task: analyze webcam feed for mood context"""
|
||||
cap = cv2.VideoCapture(0)
|
||||
|
||||
while True:
|
||||
ret, frame = cap.read()
|
||||
if not ret:
|
||||
await asyncio.sleep(0.1)
|
||||
continue
|
||||
|
||||
# Run face detection (Haar Cascade - fast)
|
||||
faces = face_cascade.detectMultiScale(frame, 1.3, 5)
|
||||
|
||||
if len(faces) > 0:
|
||||
# Analyze emotion for context
|
||||
emotion = await detect_emotion(faces[0])
|
||||
await hex_context.update_mood(emotion)
|
||||
|
||||
# Process max 3 FPS to avoid blocking
|
||||
await asyncio.sleep(0.33)
|
||||
```
|
||||
|
||||
**Critical Pattern**: Never run CV on main event loop. Use `asyncio.to_thread()` for blocking OpenCV calls:
|
||||
|
||||
```python
|
||||
# WRONG: blocks event loop
|
||||
emotion = detect_emotion(frame)
|
||||
|
||||
# RIGHT: non-blocking
|
||||
emotion = await asyncio.to_thread(detect_emotion, frame)
|
||||
```
|
||||
|
||||
### Emotion Detection Libraries
|
||||
|
||||
| Library | Model Size | Accuracy | Speed |
|
||||
|---|---|---|---|
|
||||
| **DeepFace** | ~40MB | 90%+ | 50-100ms/face |
|
||||
| **FER2013** | ~10MB | 65-75% | 10-20ms/face |
|
||||
| **MediaPipe** | ~20MB | 80%+ | 20-30ms/face |
|
||||
|
||||
**Recommendation**: DeepFace is industry standard. FER2013 if latency is critical.
|
||||
|
||||
```bash
|
||||
pip install deepface
|
||||
pip install torch torchvision
|
||||
|
||||
# Usage
|
||||
from deepface import DeepFace
|
||||
|
||||
result = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False)
|
||||
emotion = result[0]['dominant_emotion'] # 'happy', 'sad', 'angry', etc.
|
||||
```
|
||||
|
||||
### Screen Sharing Analysis (Optional)
|
||||
|
||||
For context like "user is watching X game":
|
||||
```python
|
||||
# OCR for text detection
|
||||
pip install pytesseract
|
||||
|
||||
# UI detection (ResNet-based)
|
||||
pip install screen-recognition
|
||||
|
||||
# Together: detect game UI, read text, determine context
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Memory Architecture
|
||||
|
||||
### Short-Term Memory: SQLite
|
||||
|
||||
**Purpose**: Store conversation history, user preferences, relationship state
|
||||
|
||||
**Schema**:
|
||||
```sql
|
||||
CREATE TABLE conversations (
|
||||
id INTEGER PRIMARY KEY,
|
||||
user_id TEXT NOT NULL,
|
||||
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
message TEXT NOT NULL,
|
||||
sender TEXT NOT NULL, -- 'user' or 'hex'
|
||||
emotion TEXT, -- detected from webcam/tone
|
||||
context TEXT -- screen state, game, etc.
|
||||
);
|
||||
|
||||
CREATE TABLE user_relationships (
|
||||
user_id TEXT PRIMARY KEY,
|
||||
first_seen DATETIME,
|
||||
interaction_count INTEGER,
|
||||
favorite_topics TEXT, -- JSON array
|
||||
known_traits TEXT, -- JSON
|
||||
last_interaction DATETIME
|
||||
);
|
||||
|
||||
CREATE TABLE hex_state (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE INDEX idx_user_timestamp ON conversations(user_id, timestamp);
|
||||
```
|
||||
|
||||
**Query Pattern** (for context retrieval):
|
||||
```python
|
||||
import sqlite3
|
||||
|
||||
def get_recent_context(user_id: str, num_messages: int = 20) -> list[str]:
|
||||
"""Retrieve conversation history for LLM context"""
|
||||
conn = sqlite3.connect("hex.db")
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT sender, message FROM conversations
|
||||
WHERE user_id = ?
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT ?
|
||||
""", (user_id, num_messages))
|
||||
|
||||
history = cursor.fetchall()
|
||||
conn.close()
|
||||
|
||||
# Format for LLM
|
||||
return [f"{sender}: {message}" for sender, message in reversed(history)]
|
||||
```
|
||||
|
||||
### Long-Term Memory: Vector Database
|
||||
|
||||
**Purpose**: Semantic search over past interactions ("Remember when we talked about...?")
|
||||
|
||||
**Recommendation: ChromaDB (Development) → Qdrant (Production)**
|
||||
|
||||
**ChromaDB** (for now):
|
||||
- Embedded in Python process
|
||||
- Zero setup
|
||||
- 4x faster in 2025 Rust rewrite
|
||||
- Scales to ~1M vectors on single machine
|
||||
|
||||
**Migration Path**: Start with ChromaDB, migrate to Qdrant if vector count exceeds 100k or response latency matters.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install chromadb
|
||||
|
||||
# Usage
|
||||
import chromadb
|
||||
|
||||
client = chromadb.EphemeralClient() # In-memory for dev
|
||||
# or
|
||||
client = chromadb.PersistentClient(path="./hex_vectors") # Persistent
|
||||
|
||||
collection = client.get_or_create_collection(
|
||||
name="conversation_memories",
|
||||
metadata={"hnsw:space": "cosine"}
|
||||
)
|
||||
|
||||
# Store memory
|
||||
collection.add(
|
||||
ids=[f"msg_{timestamp}"],
|
||||
documents=[message_text],
|
||||
metadatas=[{"user_id": user_id, "date": timestamp}],
|
||||
embeddings=[embedding_vector]
|
||||
)
|
||||
|
||||
# Retrieve similar memories
|
||||
results = collection.query(
|
||||
query_texts=["user likes playing valorant"],
|
||||
n_results=3
|
||||
)
|
||||
```
|
||||
|
||||
### Embedding Model
|
||||
|
||||
**Recommendation**: `sentence-transformers/all-MiniLM-L6-v2` (384-dim, 22MB)
|
||||
|
||||
```bash
|
||||
pip install sentence-transformers
|
||||
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
embedder = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
embedding = embedder.encode("I love playing games with you", convert_to_tensor=False)
|
||||
```
|
||||
|
||||
**Why MiniLM-L6**:
|
||||
- Small (22MB), fast (<5ms per sentence on CPU)
|
||||
- High quality (competitive with large models on semantic tasks)
|
||||
- Designed for retrieval (better than generic BERT for similarity)
|
||||
- Popular in production (battle-tested)
|
||||
|
||||
### Memory Retrieval Pattern for LLM Context
|
||||
|
||||
```python
|
||||
async def get_full_context(user_id: str, query: str) -> str:
|
||||
"""Build context string for LLM from short + long-term memory"""
|
||||
|
||||
# Short-term: recent messages
|
||||
recent_msgs = get_recent_context(user_id, num_messages=10)
|
||||
recent_text = "\n".join(recent_msgs)
|
||||
|
||||
# Long-term: semantic search
|
||||
embedding = embedder.encode(query)
|
||||
similar_memories = vectors.query(
|
||||
query_embeddings=[embedding],
|
||||
n_results=5,
|
||||
where={"user_id": {"$eq": user_id}}
|
||||
)
|
||||
|
||||
memory_text = "\n".join([
|
||||
doc for doc in similar_memories['documents'][0]
|
||||
])
|
||||
|
||||
# Relationship state
|
||||
relationship = get_user_relationship(user_id)
|
||||
|
||||
return f"""Recent conversation:
|
||||
{recent_text}
|
||||
|
||||
Relevant memories:
|
||||
{memory_text}
|
||||
|
||||
About {user_id}: {relationship['known_traits']}
|
||||
"""
|
||||
```
|
||||
|
||||
### Confidence Levels
|
||||
- **Short-term (SQLite)**: HIGH — mature, proven
|
||||
- **Long-term (ChromaDB)**: MEDIUM — good for dev, test migration path early
|
||||
- **Embeddings (MiniLM)**: HIGH — widely adopted, production-ready
|
||||
|
||||
---
|
||||
|
||||
## Python Async Patterns
|
||||
|
||||
### Core Discord.py + LLM Integration
|
||||
|
||||
**The Problem**: Discord bot event loop blocks if you call LLM synchronously.
|
||||
|
||||
**The Solution**: Always use `asyncio.create_task()` for I/O-bound work.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from discord.ext import commands
|
||||
|
||||
@commands.Cog.listener()
|
||||
async def on_message(self, message: discord.Message):
|
||||
"""Non-blocking message handling"""
|
||||
if message.author == self.bot.user:
|
||||
return
|
||||
|
||||
# Bad (blocks event loop for 5+ seconds):
|
||||
# response = generate_response(message.content)
|
||||
|
||||
# Good (non-blocking):
|
||||
async def generate_and_send():
|
||||
thinking = await message.channel.send("*thinking*...")
|
||||
response = await asyncio.to_thread(
|
||||
generate_response,
|
||||
message.content
|
||||
)
|
||||
await thinking.edit(content=response)
|
||||
|
||||
asyncio.create_task(generate_and_send())
|
||||
```
|
||||
|
||||
### Concurrent Task Patterns
|
||||
|
||||
**Pattern 1: Parallel LLM + TTS**
|
||||
```python
|
||||
async def respond_with_voice(text: str, voice_channel):
|
||||
"""Generate response text and voice simultaneously"""
|
||||
|
||||
async def get_response():
|
||||
return await generate_llm_response(text)
|
||||
|
||||
async def get_voice():
|
||||
return await synthesize_tts(text)
|
||||
|
||||
# Run in parallel
|
||||
response_text, voice_audio = await asyncio.gather(
|
||||
get_response(),
|
||||
get_voice()
|
||||
)
|
||||
|
||||
# Send text immediately, play voice
|
||||
await channel.send(response_text)
|
||||
voice_client.play(discord.PCMAudioSource(voice_audio))
|
||||
```
|
||||
|
||||
**Pattern 2: Task Queue for Rate Limiting**
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
class ResponseQueue:
|
||||
def __init__(self, max_concurrent: int = 2):
|
||||
self.semaphore = asyncio.Semaphore(max_concurrent)
|
||||
self.pending = []
|
||||
|
||||
async def queue_response(self, user_id: str, text: str):
|
||||
async with self.semaphore:
|
||||
# Only 2 concurrent responses
|
||||
response = await generate_response(text)
|
||||
self.pending.append((user_id, response))
|
||||
return response
|
||||
|
||||
queue = ResponseQueue(max_concurrent=2)
|
||||
```
|
||||
|
||||
**Pattern 3: Background Personality Tasks**
|
||||
```python
|
||||
from discord.ext import tasks
|
||||
|
||||
class HexPersonality(commands.Cog):
|
||||
def __init__(self, bot):
|
||||
self.bot = bot
|
||||
self.mood = "neutral"
|
||||
self.update_mood.start()
|
||||
|
||||
@tasks.loop(minutes=5) # Every 5 minutes
|
||||
async def update_mood(self):
|
||||
"""Cycle personality state based on time + interactions"""
|
||||
self.mood = await calculate_mood(
|
||||
time_of_day=datetime.now(),
|
||||
recent_interactions=self.get_recent_count(),
|
||||
sleep_deprived=self.is_late_night()
|
||||
)
|
||||
|
||||
# Emit mood change to memory
|
||||
await self.bot.hex_db.update_state("current_mood", self.mood)
|
||||
|
||||
@update_mood.before_loop
|
||||
async def before_update_mood(self):
|
||||
await self.bot.wait_until_ready()
|
||||
```
|
||||
|
||||
### Handling CPU-Bound Work
|
||||
|
||||
**OpenCV, emotion detection, transcription are CPU-bound.**
|
||||
|
||||
```python
|
||||
# Pattern: Use to_thread for CPU work
|
||||
emotion = await asyncio.to_thread(
|
||||
analyze_emotion,
|
||||
frame
|
||||
)
|
||||
|
||||
# Pattern: Use ThreadPoolExecutor for multiple CPU tasks
|
||||
executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
|
||||
loop = asyncio.get_event_loop()
|
||||
|
||||
emotion = await loop.run_in_executor(executor, analyze_emotion, frame)
|
||||
```
|
||||
|
||||
### Error Handling & Resilience
|
||||
|
||||
```python
|
||||
async def safe_generate_response(message: str) -> str:
|
||||
"""Generate response with fallback"""
|
||||
try:
|
||||
response = await asyncio.wait_for(
|
||||
generate_llm_response(message),
|
||||
timeout=5.0 # 5-second timeout
|
||||
)
|
||||
return response
|
||||
except asyncio.TimeoutError:
|
||||
return "I'm thinking too hard... ask me again?"
|
||||
except Exception as e:
|
||||
logger.error(f"Generation failed: {e}")
|
||||
return "*confused goblin noises*"
|
||||
```
|
||||
|
||||
### Concurrent Request Management (Discord.py)
|
||||
|
||||
```python
|
||||
class ConcurrencyManager:
|
||||
def __init__(self):
|
||||
self.active_tasks = {}
|
||||
self.max_per_user = 1 # One response at a time per user
|
||||
|
||||
async def handle_message(self, user_id: str, text: str):
|
||||
if user_id in self.active_tasks and not self.active_tasks[user_id].done():
|
||||
return "I'm still thinking from last time!"
|
||||
|
||||
task = asyncio.create_task(generate_response(text))
|
||||
self.active_tasks[user_id] = task
|
||||
|
||||
try:
|
||||
response = await task
|
||||
return response
|
||||
finally:
|
||||
del self.active_tasks[user_id]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Known Pitfalls & Solutions
|
||||
|
||||
### 1. **Discord Event Loop Blocking**
|
||||
**Problem**: Synchronous LLM calls block the bot, causing timeouts on other messages.
|
||||
**Solution**: Always use `asyncio.to_thread()` or `asyncio.create_task()`.
|
||||
|
||||
### 2. **Whisper Hallucination on Silence**
|
||||
**Problem**: Whisper can generate text from pure background noise.
|
||||
**Solution**: Implement voice activity detection (VAD) before transcription.
|
||||
```python
|
||||
import librosa
|
||||
|
||||
def has_speech(audio_path, threshold=-35):
|
||||
"""Check if audio has meaningful energy"""
|
||||
y, sr = librosa.load(audio_path)
|
||||
S = librosa.feature.melspectrogram(y=y, sr=sr)
|
||||
S_db = librosa.power_to_db(S, ref=np.max)
|
||||
mean_energy = np.mean(S_db)
|
||||
return mean_energy > threshold
|
||||
```
|
||||
|
||||
### 3. **Vector DB Scale Creep**
|
||||
**Problem**: ChromaDB slows down as memories accumulate.
|
||||
**Solution**: Archive old memories, implement periodic cleanup.
|
||||
```python
|
||||
# Archive conversations older than 90 days
|
||||
old_threshold = datetime.now() - timedelta(days=90)
|
||||
db.cleanup_old_memories(older_than=old_threshold)
|
||||
```
|
||||
|
||||
### 4. **Model Memory Growth**
|
||||
**Problem**: Loading Llama 3.1 8B in 4-bit still uses ~6GB, leaving little room for TTS/CV models.
|
||||
**Solution**: Use offloading or accept single-component operation.
|
||||
```python
|
||||
# Option 1: Offload LLM to CPU between requests
|
||||
# Option 2: Run TTS/CV in separate process
|
||||
# Option 3: Use smaller model (Mistral 7B) when GPU-constrained
|
||||
```
|
||||
|
||||
### 5. **Async Context Issues**
|
||||
**Problem**: Storing references to coroutines without awaiting them.
|
||||
**Solution**: Always create tasks explicitly:
|
||||
```python
|
||||
# Bad
|
||||
coro = generate_response(text) # Dangling coroutine
|
||||
|
||||
# Good
|
||||
task = asyncio.create_task(generate_response(text))
|
||||
response = await task
|
||||
```
|
||||
|
||||
### 6. **Personality Inconsistency**
|
||||
**Problem**: LLM generates different responses with same prompt due to randomness.
|
||||
**Solution**: Use consistent temperature and seed management.
|
||||
```python
|
||||
# Conversation context → lower temperature (0.5)
|
||||
# Creative/chaotic moments → higher temperature (0.9)
|
||||
temperature = 0.7 if in_serious_context else 0.9
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Deployment Configuration
|
||||
|
||||
```yaml
|
||||
# Local Development (Hex primary environment)
|
||||
gpu: RTX 3060 Ti+ (12GB VRAM)
|
||||
llm: Llama 3.1 8B (4-bit via Ollama)
|
||||
tts: Kokoro 82M
|
||||
stt: faster-whisper large-v3
|
||||
avatar: VRoid + VSeeFace
|
||||
database: SQLite + ChromaDB (embedded)
|
||||
inference_latency: 3-10 seconds per response
|
||||
cost: $0/month (open-source stack)
|
||||
|
||||
# Optional: Production Scaling
|
||||
gpu_cluster: vLLM on multi-GPU for concurrency
|
||||
database: Qdrant (cloud) + PostgreSQL for history
|
||||
inference_latency: <2 seconds (batching + optimization)
|
||||
cost: ~$200-500/month cloud compute
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Confidence Levels & 2026 Readiness
|
||||
|
||||
| Component | Recommendation | Confidence | 2026 Status |
|
||||
|---|---|---|---|
|
||||
| Discord.py 2.6.4+ | PRIMARY | HIGH | Stable, actively maintained |
|
||||
| Llama 3.1 8B | PRIMARY | HIGH | Proven, production-ready |
|
||||
| Mistral 7B | SECONDARY | HIGH | Fast-path fallback, stable |
|
||||
| Ollama | PRIMARY | MEDIUM | Mature but rapidly evolving |
|
||||
| vLLM | ALTERNATIVE | MEDIUM | High-performance alternative, v0.3+ recommended |
|
||||
| Whisper Large V3 + faster-whisper | PRIMARY | HIGH | Gold standard for multilingual STT |
|
||||
| Kokoro TTS | PRIMARY | MEDIUM | Emerging, high quality for size |
|
||||
| XTTS-v2 | SPECIAL MOMENTS | HIGH | Voice cloning working well |
|
||||
| VRoid + VSeeFace | PRIMARY | MEDIUM | Workaround viable, not native integration |
|
||||
| ChromaDB | DEVELOPMENT | MEDIUM | Good for prototyping, evaluate Qdrant before 100k vectors |
|
||||
| Qdrant | PRODUCTION | HIGH | Enterprise vector DB, proven at scale |
|
||||
| OpenCV 4.10+ | PRIMARY | HIGH | Stable, mature ecosystem |
|
||||
| DeepFace emotion detection | PRIMARY | HIGH | Industry standard, 90%+ accuracy |
|
||||
| Python asyncio patterns | PRIMARY | HIGH | Python 3.11+ well-supported |
|
||||
|
||||
**Confidence Interpretation**:
|
||||
- **HIGH**: Production-ready, API stable, no major changes expected in 2026
|
||||
- **MEDIUM**: Solid choice but newer ecosystem (1-2 year old), evaluate alternatives annually
|
||||
- **LOW**: Emerging or unstable; prototype only
|
||||
|
||||
---
|
||||
|
||||
## Installation Checklist (Get Started)
|
||||
|
||||
```bash
|
||||
# Discord
|
||||
pip install discord.py>=2.6.4
|
||||
|
||||
# LLM & inference
|
||||
pip install ollama torch transformers bitsandbytes
|
||||
|
||||
# TTS/STT
|
||||
pip install faster-whisper
|
||||
pip install sentence-transformers torch
|
||||
|
||||
# Vector DB
|
||||
pip install chromadb
|
||||
|
||||
# Vision
|
||||
pip install opencv-python deepface librosa
|
||||
|
||||
# Async utilities
|
||||
pip install httpx aiofiles
|
||||
|
||||
# Database
|
||||
pip install aiosqlite
|
||||
|
||||
# Start services
|
||||
ollama serve &
|
||||
# (Loads models on first run)
|
||||
|
||||
# Test basic chain
|
||||
python test_stack.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (For Roadmap)
|
||||
|
||||
1. **Phase 1**: Discord.py + Ollama + basic LLM integration (1 week)
|
||||
2. **Phase 2**: STT pipeline (Whisper) + TTS (Kokoro) (1 week)
|
||||
3. **Phase 3**: Memory system (SQLite + ChromaDB) (1 week)
|
||||
4. **Phase 4**: Personality framework + system prompts (1 week)
|
||||
5. **Phase 5**: Webcam emotion detection + context integration (1 week)
|
||||
6. **Phase 6**: VRoid avatar + screen share integration (1 week)
|
||||
7. **Phase 7**: Self-modification capability + safety guards (2 weeks)
|
||||
|
||||
**Total**: ~8 weeks to full-featured Hex prototype.
|
||||
|
||||
---
|
||||
|
||||
## References & Research Sources
|
||||
|
||||
### Discord Integration
|
||||
- [Discord.py Documentation](https://discordpy.readthedocs.io/en/stable/index.html)
|
||||
- [Discord.py Async Patterns](https://discordpy.readthedocs.io/en/stable/ext/tasks/index.html)
|
||||
- [Discord.py on GitHub](https://github.com/Rapptz/discord.py)
|
||||
|
||||
### Local LLMs
|
||||
- [Llama 3.1 vs Mistral Comparison](https://kanerika.com/blogs/mistral-vs-llama-3/)
|
||||
- [Llama.com Quantization Guide](https://www.llama.com/docs/how-to-guides/quantization/)
|
||||
- [Ollama vs vLLM Deep Dive](https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking)
|
||||
- [Local LLM Hosting 2026 Guide](https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/)
|
||||
|
||||
### TTS/STT
|
||||
- [Whisper Large V3 2026 Benchmarks](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks/)
|
||||
- [Faster-Whisper GitHub](https://github.com/SYSTRAN/faster-whisper)
|
||||
- [Best Open Source TTS 2026](https://northflank.com/blog/best-open-source-text-to-speech-models-and-how-to-run-them)
|
||||
- [Whisper Streaming for Real-Time](https://github.com/ufal/whisper_streaming)
|
||||
|
||||
### Computer Vision
|
||||
- [Real-Time Facial Emotion Recognition with OpenCV](https://learnopencv.com/facial-emotion-recognition/)
|
||||
- [DeepFace for Emotion Detection](https://github.com/serengp/deepface)
|
||||
|
||||
### Vector Databases
|
||||
- [Vector Database Comparison 2026](https://www.datacamp.com/blog/the-top-5-vector-databases)
|
||||
- [ChromaDB vs Pinecone Analysis](https://www.myscale.com/blog/choosing-best-vector-database-for-your-project/)
|
||||
- [Chroma Documentation](https://docs.trychroma.com/)
|
||||
|
||||
### Python Async
|
||||
- [Python Asyncio for LLM Concurrency](https://www.newline.co/@zaoyang/python-asyncio-for-llm-concurrency-best-practices--bc079176)
|
||||
- [Asyncio Best Practices 2025](https://sparkco.ai/blog/mastering-async-best-practices-for-2025/)
|
||||
- [FastAPI with Asyncio](https://www.nucamp.co/blog/coding-bootcamp-backend-with-python-2025-python-in-the-backend-in-2025-leveraging-asyncio-and-fastapi-for-highperformance-systems)
|
||||
|
||||
### VRoid & Avatars
|
||||
- [VRoid Studio Official](https://vroid.com/en/studio)
|
||||
- [VRoid Hub API](https://vroid.pixiv.help/hc/en-us/articles/21569104969241-The-VRoid-Hub-API-is-now-live)
|
||||
- [VSeeFace for VRoid](https://www.vseeface.icu/)
|
||||
|
||||
---
|
||||
|
||||
**Document Version**: 1.0
|
||||
**Last Updated**: January 2026
|
||||
**Hex Stack Status**: Ready for implementation
|
||||
**Estimated Implementation Time**: 8-12 weeks (to full personality bot)
|
||||
Reference in New Issue
Block a user