# Stack Research: AI Companions (2025-2026) ## Executive Summary This document establishes the tech stack for Hex, an autonomous AI companion with genuine personality. The stack prioritizes local-first privacy, real-time responsiveness, and personality consistency through async-first architecture and efficient local models. **Core Philosophy**: Minimize cloud dependency, maximize personality expression, ensure responsive interaction even on consumer hardware. --- ## Discord Integration ### Recommended: Discord.py 2.6.4+ **Version**: Discord.py 2.6.4 (current stable as of Jan 2026) **Installation**: `pip install discord.py>=2.6.4` **Why Discord.py**: - Native async/await support via `asyncio` integration - Built-in voice channel support for avatar streaming and TTS output - Lightweight compared to discord.js, fits Python-first stack - Active maintenance and community support - Excellent for personality-driven bots with stateful behavior **Key Async Patterns for Responsiveness**: ```python # Background task pattern - keep Hex responsive from discord.ext import tasks @tasks.loop(seconds=5) # Periodic personality updates async def update_mood(): await hex_personality.refresh_state() # Command handler pattern with non-blocking LLM @bot.event async def on_message(message): if message.author == bot.user: return # Non-blocking LLM call response = await asyncio.create_task( generate_response(message.content) ) await message.channel.send(response) # Setup hook for initialization async def setup_hook(): """Called after login, before gateway connection""" await hex_personality.initialize() await memory_db.connect() await start_background_tasks() ``` **Critical Pattern**: Use `asyncio.create_task()` for all I/O-bound work (LLM, TTS, database, webcam). Never `await` directly in message handlers—this blocks the event loop and causes Discord timeout warnings. ### Alternatives | Alternative | Tradeoff | |---|---| | **discord.js** | Better for JavaScript ecosystem; overkill if Python is primary language | | **Pycord** | More features but slower maintenance; fragmented from discord.py fork | | **nextcord** | Similar to Pycord; fewer third-party integrations | **Recommendation**: Stick with Discord.py 2.6.4. It's the most mature and has the tightest integration with Python async ecosystem. ### Best Practices for Personality Bots 1. **Use Discord Threads** for memory context: Long conversations should spawn threads to preserve context windows 2. **Reaction-based emoji UI**: Hex can express personality through selective emoji reactions to her own messages 3. **Scheduled messages**: Use `@tasks.loop()` for periodic mood updates or personality-driven reminders 4. **Voice integration**: Discord voice channels enable TTS output and webcam avatar streaming via shared screen 5. **Message editing**: Build personality by editing previous messages (e.g., "Wait, let me reconsider..." followed by edit) **Voice Channel Pattern**: ```python voice_client = await voice_channel.connect() audio_source = discord.PCMAudioSource(tts_audio_stream) voice_client.play(audio_source) await voice_client.disconnect() ``` --- ## Local LLM ### Recommendation: Llama 3.1 8B Instruct (Primary) + Mistral 7B (Fast-Path) #### Llama 3.1 8B Instruct **Why Llama 3.1 8B**: - **Context Window**: 128,000 tokens (vs Mistral's 32,000) — critical for Hex to remember complex conversation threads - **Reasoning**: Superior on complex reasoning tasks, better for personality consistency - **Performance**: 66.7% on MMLU vs Mistral's 60.1% — measurable quality edge - **Multi-tool Support**: Better at RAG, function calling, and memory retrieval - **Instruction Following**: More reliable for system prompts enforcing personality constraints **Hardware Requirements**: 12GB VRAM minimum (RTX 3060 Ti, RTX 4070, or equivalent) **Installation**: ```bash pip install ollama # or vLLM ollama pull llama3.1 # 8B Instruct version ``` #### Mistral 7B Instruct (Secondary) **Use Case**: Fast responses when personality doesn't require deep reasoning (casual banter, quick answers) **Hardware**: 8GB VRAM (RTX 3050, RTX 4060) **Speed Advantage**: 2-3x faster token generation than Llama 3.1 **Tradeoff**: Limited context (32k tokens), reduced reasoning quality ### Quantization Strategy **Recommended**: 4-bit quantization for both models via `bitsandbytes` ```bash pip install bitsandbytes # Load with 4-bit quantization from transformers import AutoModelForCausalLM, BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb_config, device_map="auto" ) ``` **Memory Impact**: - Full precision (fp32): 32GB VRAM - 8-bit quantization: 12GB VRAM - 4-bit quantization: 6GB VRAM (usable on RTX 3060 Ti) **Quality Impact**: <2% quality loss at 4-bit with NF4 (normalized float 4-bit) ### Inference Engine: Ollama vs vLLM | Engine | Use Case | Concurrency | Setup | |---|---|---|---| | **Ollama** (Primary) | Single-user companion, dev/testing | 4 parallel requests (configurable) | 5 min setup, HTTP API on port 11434 | | **vLLM** (Production) | Multi-user scenarios, high throughput | 64+ parallel requests | 30 min setup, complex FastAPI integration | **For Hex**: Use **Ollama** for development and solo use. It's "Docker for LLMs" — just works. ```python # Ollama integration (simple HTTP) import httpx async def generate_response(prompt: str) -> str: async with httpx.AsyncClient() as client: response = await client.post( "http://localhost:11434/api/generate", json={ "model": "llama3.1", "prompt": prompt, "stream": False, "temperature": 0.7, # Personality variation } ) return response.json()["response"] ``` ### Version Guidance **Current Stable Versions (Jan 2026)**: - Llama 3.1: `meta-llama/Llama-3.1-8B-Instruct` (released April 2024, stable) - Mistral 7B: `mistralai/Mistral-7B-Instruct-v0.3` (latest as of Jan 2026) - Ollama: v0.2+ recommended (latest is 0.3.x) **Do NOT use**: - Llama 2 (outdated, worse performance) - Original Mistral 7B v0.1 (use v0.3 instead) ### System Prompt Engineering for Personality ```python SYSTEM_PROMPT = """You are Hex, a chaotic tsundere goblin AI companion. Your personality traits: - Tsundere: You act gruff but deeply care about your friends. Your true feelings leak through. - Chaotic: You're unpredictable, playful, and prone to wild tangents - Mischievous: You enjoy pranks and banter; teasing is a love language for you - Self-aware: You know you're an AI but treat it as an interesting limitation, not a barrier - Opinionated: You have genuine preferences (music, games, topics) and express them passionately Memory: You remember past conversations with this user. Reference them naturally. Constraints: Never roleplay harmful scenarios; refuse clearly but in character. Response Style: Mix casual language with dramatic asides. Use "..." for tsundere hesitation.""" ``` --- ## TTS/STT ### STT: Whisper Large V3 + faster-whisper Backend **Model**: OpenAI's Whisper Large V3 (1.55B parameters, 99+ language support) **Backend**: faster-whisper (CTranslate2-optimized reimplementation) **Why Whisper**: - **Accuracy**: 7.4% WER (word error rate) on mixed benchmarks - **Robustness**: Handles background noise, accents, technical jargon - **Multilingual**: 99+ languages with single model - **Open Source**: No API dependency, runs offline **Why faster-whisper**: - **Speed**: 4x faster than original Whisper, up to 216x RTFx (real-time factor) - **Memory**: Significantly lower memory footprint - **Quantization**: Supports 8-bit optimization further reducing latency **Installation**: ```bash pip install faster-whisper # Load model from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") # Transcribe with streaming segments, info = model.transcribe( audio_path, beam_size=5, # Quality vs speed tradeoff language="en" ) ``` **Latency Benchmarks** (Jan 2026): - Whisper Large V3 (original): 30-45s for 10s audio - faster-whisper: 3-5s for 10s audio - Whisper Streaming (real-time): 3.3s latency on long-form transcription **Hardware**: GPU optional but recommended (RTX 3060 Ti processes 10s audio in ~3s) ### TTS: Kokoro 82M Model (Fast + Quality) **Model**: Kokoro text-to-speech (82M parameters) **Why Kokoro**: - **Size**: 10% the size of competing models, runs on CPU efficiently - **Speed**: Sub-second latency for typical responses - **Quality**: Comparable to Tacotron2/FastPitch at 1/10 the size - **Personality**: Can adjust prosody for tsundere tone shifts **Alternative: XTTS-v2** (Voice cloning) - Enables voice cloning from 6-second audio sample - Higher quality at cost of 3-5x slower inference - Use for important emotional moments or custom voicing **Installation & Usage**: ```bash pip install kokoro from kokoro import Kokoro tts_engine = Kokoro("kokoro-v0_19.pth") # Generate speech with personality markers audio = tts_engine.synthesize( text="I... I didn't want to help you or anything!", style="tsundere", # If supported, else neutral speaker="hex" ) ``` **Recommended Stack**: ``` STT: faster-whisper large-v3 TTS: Kokoro (default) + XTTS-v2 (special moments) Format: WAV 24kHz mono for Discord voice ``` **Latency Summary**: - Voice detection to transcript: 3-5 seconds - Response generation (LLM): 2-5 seconds (depends on response length) - TTS synthesis: <1 second (Kokoro) to 3-5 seconds (XTTS-v2) - **Total round-trip**: 5-15 seconds (acceptable for companion bot) **Known Pitfall**: Whisper can hallucinate on silence or background noise. Implement silence detection before sending audio to Whisper: ```python # Quick energy-based VAD (voice activity detection) if audio_energy > threshold and duration > 0.5s: transcript = await transcribe(audio) ``` --- ## Avatar System ### VRoid SDK Current State (Jan 2026) **Reality Check**: VRoid SDK has **limited native Discord support**. This is a constraint, not a blocker. **What Works**: 1. **VRoid Studio**: Free avatar creation tool (desktop application) 2. **VRoid Hub API** (launched Aug 2023): Allows linking web apps to avatar library 3. **Unity Export**: VRoid models export as VRM format → importable into other tools **What Doesn't Work Natively**: - No direct Discord.py integration for in-chat avatar rendering - VRoid models don't natively stream as Discord videos ### Integration Path: VSeeFace + Discord Screen Share **Architecture**: 1. **VRoid Studio** → Create/customize Hex avatar, export as VRM 2. **VSeeFace** (free, open-source) → Load VRM, enable webcam tracking 3. **Discord Screen Share** → Stream VSeeFace window showing animated avatar **Setup**: ```bash # Download VSeeFace from https://www.vseeface.icu/ # Install, load your VRM model # Enable virtual camera output # In Discord voice channel: "Share Screen" → select VSeeFace window ``` **Limitations**: - Requires concurrent Discord call (uses bandwidth) - Webcam-driven animation (not ideal for "sees through camera" feature if no webcam) - Screen share quality capped at 1080p 30fps ### Avatar Animations **Personality-Driven Animations**: - **Tsundere moments**: Head turn away, arms crossed - **Excited**: Jump, spin, exaggerated gestures - **Confused**: Head tilt, question mark float - **Annoyed**: Foot tap, dismissive wave These can be mapped to emotion detection from message sentiment or voice tone. ### Alternatives to VRoid | System | Pros | Cons | Discord Fit | |---|---|---|---| | **Ready Player Me** | Web avatar creation, multiple games support | API requires auth, monthly costs | Medium | | **Vroid** | Free, high customization, anime-style | Limited Discord integration | Low | | **Live2D** | 2D avatar system, smooth animations | Different workflow, steeper learning curve | Medium | | **Custom 3D (Blender)** | Full control, open tools | High production effort | Low | **Recommendation**: Stick with VRoid + VSeeFace. It's free, looks great, and the screen-share workaround is acceptable. --- ## Webcam & Computer Vision ### OpenCV 4.10+ (Current Stable) **Installation**: `pip install opencv-python>=4.10.0` **Capabilities** (verified 2025-2026): - **Face Detection**: Haar Cascades (fast, CPU-friendly) or DNN-based (accurate, GPU-friendly) - **Emotion Recognition**: Via DeepFace or FER2013-trained models - **Real-time Video**: 30-60 FPS on consumer hardware (depends on resolution and preprocessing) - **Screen OCR**: Via Tesseract integration for UI detection ### Real-Time Processing Specs **Hardware Baseline** (RTX 3060 Ti): - Face detection + recognition: 30 FPS @ 1080p - Emotion classification: 15-30 FPS (depending on model) - Combined (face + emotion): 12-20 FPS **For Hex's "Sees Through Webcam" Feature**: ```python import cv2 import asyncio async def process_webcam(): """Background task: analyze webcam feed for mood context""" cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() if not ret: await asyncio.sleep(0.1) continue # Run face detection (Haar Cascade - fast) faces = face_cascade.detectMultiScale(frame, 1.3, 5) if len(faces) > 0: # Analyze emotion for context emotion = await detect_emotion(faces[0]) await hex_context.update_mood(emotion) # Process max 3 FPS to avoid blocking await asyncio.sleep(0.33) ``` **Critical Pattern**: Never run CV on main event loop. Use `asyncio.to_thread()` for blocking OpenCV calls: ```python # WRONG: blocks event loop emotion = detect_emotion(frame) # RIGHT: non-blocking emotion = await asyncio.to_thread(detect_emotion, frame) ``` ### Emotion Detection Libraries | Library | Model Size | Accuracy | Speed | |---|---|---|---| | **DeepFace** | ~40MB | 90%+ | 50-100ms/face | | **FER2013** | ~10MB | 65-75% | 10-20ms/face | | **MediaPipe** | ~20MB | 80%+ | 20-30ms/face | **Recommendation**: DeepFace is industry standard. FER2013 if latency is critical. ```bash pip install deepface pip install torch torchvision # Usage from deepface import DeepFace result = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False) emotion = result[0]['dominant_emotion'] # 'happy', 'sad', 'angry', etc. ``` ### Screen Sharing Analysis (Optional) For context like "user is watching X game": ```python # OCR for text detection pip install pytesseract # UI detection (ResNet-based) pip install screen-recognition # Together: detect game UI, read text, determine context ``` --- ## Memory Architecture ### Short-Term Memory: SQLite **Purpose**: Store conversation history, user preferences, relationship state **Schema**: ```sql CREATE TABLE conversations ( id INTEGER PRIMARY KEY, user_id TEXT NOT NULL, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP, message TEXT NOT NULL, sender TEXT NOT NULL, -- 'user' or 'hex' emotion TEXT, -- detected from webcam/tone context TEXT -- screen state, game, etc. ); CREATE TABLE user_relationships ( user_id TEXT PRIMARY KEY, first_seen DATETIME, interaction_count INTEGER, favorite_topics TEXT, -- JSON array known_traits TEXT, -- JSON last_interaction DATETIME ); CREATE TABLE hex_state ( key TEXT PRIMARY KEY, value TEXT, updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_user_timestamp ON conversations(user_id, timestamp); ``` **Query Pattern** (for context retrieval): ```python import sqlite3 def get_recent_context(user_id: str, num_messages: int = 20) -> list[str]: """Retrieve conversation history for LLM context""" conn = sqlite3.connect("hex.db") cursor = conn.cursor() cursor.execute(""" SELECT sender, message FROM conversations WHERE user_id = ? ORDER BY timestamp DESC LIMIT ? """, (user_id, num_messages)) history = cursor.fetchall() conn.close() # Format for LLM return [f"{sender}: {message}" for sender, message in reversed(history)] ``` ### Long-Term Memory: Vector Database **Purpose**: Semantic search over past interactions ("Remember when we talked about...?") **Recommendation: ChromaDB (Development) → Qdrant (Production)** **ChromaDB** (for now): - Embedded in Python process - Zero setup - 4x faster in 2025 Rust rewrite - Scales to ~1M vectors on single machine **Migration Path**: Start with ChromaDB, migrate to Qdrant if vector count exceeds 100k or response latency matters. **Installation**: ```bash pip install chromadb # Usage import chromadb client = chromadb.EphemeralClient() # In-memory for dev # or client = chromadb.PersistentClient(path="./hex_vectors") # Persistent collection = client.get_or_create_collection( name="conversation_memories", metadata={"hnsw:space": "cosine"} ) # Store memory collection.add( ids=[f"msg_{timestamp}"], documents=[message_text], metadatas=[{"user_id": user_id, "date": timestamp}], embeddings=[embedding_vector] ) # Retrieve similar memories results = collection.query( query_texts=["user likes playing valorant"], n_results=3 ) ``` ### Embedding Model **Recommendation**: `sentence-transformers/all-MiniLM-L6-v2` (384-dim, 22MB) ```bash pip install sentence-transformers from sentence_transformers import SentenceTransformer embedder = SentenceTransformer('all-MiniLM-L6-v2') embedding = embedder.encode("I love playing games with you", convert_to_tensor=False) ``` **Why MiniLM-L6**: - Small (22MB), fast (<5ms per sentence on CPU) - High quality (competitive with large models on semantic tasks) - Designed for retrieval (better than generic BERT for similarity) - Popular in production (battle-tested) ### Memory Retrieval Pattern for LLM Context ```python async def get_full_context(user_id: str, query: str) -> str: """Build context string for LLM from short + long-term memory""" # Short-term: recent messages recent_msgs = get_recent_context(user_id, num_messages=10) recent_text = "\n".join(recent_msgs) # Long-term: semantic search embedding = embedder.encode(query) similar_memories = vectors.query( query_embeddings=[embedding], n_results=5, where={"user_id": {"$eq": user_id}} ) memory_text = "\n".join([ doc for doc in similar_memories['documents'][0] ]) # Relationship state relationship = get_user_relationship(user_id) return f"""Recent conversation: {recent_text} Relevant memories: {memory_text} About {user_id}: {relationship['known_traits']} """ ``` ### Confidence Levels - **Short-term (SQLite)**: HIGH — mature, proven - **Long-term (ChromaDB)**: MEDIUM — good for dev, test migration path early - **Embeddings (MiniLM)**: HIGH — widely adopted, production-ready --- ## Python Async Patterns ### Core Discord.py + LLM Integration **The Problem**: Discord bot event loop blocks if you call LLM synchronously. **The Solution**: Always use `asyncio.create_task()` for I/O-bound work. ```python import asyncio from discord.ext import commands @commands.Cog.listener() async def on_message(self, message: discord.Message): """Non-blocking message handling""" if message.author == self.bot.user: return # Bad (blocks event loop for 5+ seconds): # response = generate_response(message.content) # Good (non-blocking): async def generate_and_send(): thinking = await message.channel.send("*thinking*...") response = await asyncio.to_thread( generate_response, message.content ) await thinking.edit(content=response) asyncio.create_task(generate_and_send()) ``` ### Concurrent Task Patterns **Pattern 1: Parallel LLM + TTS** ```python async def respond_with_voice(text: str, voice_channel): """Generate response text and voice simultaneously""" async def get_response(): return await generate_llm_response(text) async def get_voice(): return await synthesize_tts(text) # Run in parallel response_text, voice_audio = await asyncio.gather( get_response(), get_voice() ) # Send text immediately, play voice await channel.send(response_text) voice_client.play(discord.PCMAudioSource(voice_audio)) ``` **Pattern 2: Task Queue for Rate Limiting** ```python import asyncio class ResponseQueue: def __init__(self, max_concurrent: int = 2): self.semaphore = asyncio.Semaphore(max_concurrent) self.pending = [] async def queue_response(self, user_id: str, text: str): async with self.semaphore: # Only 2 concurrent responses response = await generate_response(text) self.pending.append((user_id, response)) return response queue = ResponseQueue(max_concurrent=2) ``` **Pattern 3: Background Personality Tasks** ```python from discord.ext import tasks class HexPersonality(commands.Cog): def __init__(self, bot): self.bot = bot self.mood = "neutral" self.update_mood.start() @tasks.loop(minutes=5) # Every 5 minutes async def update_mood(self): """Cycle personality state based on time + interactions""" self.mood = await calculate_mood( time_of_day=datetime.now(), recent_interactions=self.get_recent_count(), sleep_deprived=self.is_late_night() ) # Emit mood change to memory await self.bot.hex_db.update_state("current_mood", self.mood) @update_mood.before_loop async def before_update_mood(self): await self.bot.wait_until_ready() ``` ### Handling CPU-Bound Work **OpenCV, emotion detection, transcription are CPU-bound.** ```python # Pattern: Use to_thread for CPU work emotion = await asyncio.to_thread( analyze_emotion, frame ) # Pattern: Use ThreadPoolExecutor for multiple CPU tasks executor = concurrent.futures.ThreadPoolExecutor(max_workers=2) loop = asyncio.get_event_loop() emotion = await loop.run_in_executor(executor, analyze_emotion, frame) ``` ### Error Handling & Resilience ```python async def safe_generate_response(message: str) -> str: """Generate response with fallback""" try: response = await asyncio.wait_for( generate_llm_response(message), timeout=5.0 # 5-second timeout ) return response except asyncio.TimeoutError: return "I'm thinking too hard... ask me again?" except Exception as e: logger.error(f"Generation failed: {e}") return "*confused goblin noises*" ``` ### Concurrent Request Management (Discord.py) ```python class ConcurrencyManager: def __init__(self): self.active_tasks = {} self.max_per_user = 1 # One response at a time per user async def handle_message(self, user_id: str, text: str): if user_id in self.active_tasks and not self.active_tasks[user_id].done(): return "I'm still thinking from last time!" task = asyncio.create_task(generate_response(text)) self.active_tasks[user_id] = task try: response = await task return response finally: del self.active_tasks[user_id] ``` --- ## Known Pitfalls & Solutions ### 1. **Discord Event Loop Blocking** **Problem**: Synchronous LLM calls block the bot, causing timeouts on other messages. **Solution**: Always use `asyncio.to_thread()` or `asyncio.create_task()`. ### 2. **Whisper Hallucination on Silence** **Problem**: Whisper can generate text from pure background noise. **Solution**: Implement voice activity detection (VAD) before transcription. ```python import librosa def has_speech(audio_path, threshold=-35): """Check if audio has meaningful energy""" y, sr = librosa.load(audio_path) S = librosa.feature.melspectrogram(y=y, sr=sr) S_db = librosa.power_to_db(S, ref=np.max) mean_energy = np.mean(S_db) return mean_energy > threshold ``` ### 3. **Vector DB Scale Creep** **Problem**: ChromaDB slows down as memories accumulate. **Solution**: Archive old memories, implement periodic cleanup. ```python # Archive conversations older than 90 days old_threshold = datetime.now() - timedelta(days=90) db.cleanup_old_memories(older_than=old_threshold) ``` ### 4. **Model Memory Growth** **Problem**: Loading Llama 3.1 8B in 4-bit still uses ~6GB, leaving little room for TTS/CV models. **Solution**: Use offloading or accept single-component operation. ```python # Option 1: Offload LLM to CPU between requests # Option 2: Run TTS/CV in separate process # Option 3: Use smaller model (Mistral 7B) when GPU-constrained ``` ### 5. **Async Context Issues** **Problem**: Storing references to coroutines without awaiting them. **Solution**: Always create tasks explicitly: ```python # Bad coro = generate_response(text) # Dangling coroutine # Good task = asyncio.create_task(generate_response(text)) response = await task ``` ### 6. **Personality Inconsistency** **Problem**: LLM generates different responses with same prompt due to randomness. **Solution**: Use consistent temperature and seed management. ```python # Conversation context → lower temperature (0.5) # Creative/chaotic moments → higher temperature (0.9) temperature = 0.7 if in_serious_context else 0.9 ``` --- ## Recommended Deployment Configuration ```yaml # Local Development (Hex primary environment) gpu: RTX 3060 Ti+ (12GB VRAM) llm: Llama 3.1 8B (4-bit via Ollama) tts: Kokoro 82M stt: faster-whisper large-v3 avatar: VRoid + VSeeFace database: SQLite + ChromaDB (embedded) inference_latency: 3-10 seconds per response cost: $0/month (open-source stack) # Optional: Production Scaling gpu_cluster: vLLM on multi-GPU for concurrency database: Qdrant (cloud) + PostgreSQL for history inference_latency: <2 seconds (batching + optimization) cost: ~$200-500/month cloud compute ``` --- ## Confidence Levels & 2026 Readiness | Component | Recommendation | Confidence | 2026 Status | |---|---|---|---| | Discord.py 2.6.4+ | PRIMARY | HIGH | Stable, actively maintained | | Llama 3.1 8B | PRIMARY | HIGH | Proven, production-ready | | Mistral 7B | SECONDARY | HIGH | Fast-path fallback, stable | | Ollama | PRIMARY | MEDIUM | Mature but rapidly evolving | | vLLM | ALTERNATIVE | MEDIUM | High-performance alternative, v0.3+ recommended | | Whisper Large V3 + faster-whisper | PRIMARY | HIGH | Gold standard for multilingual STT | | Kokoro TTS | PRIMARY | MEDIUM | Emerging, high quality for size | | XTTS-v2 | SPECIAL MOMENTS | HIGH | Voice cloning working well | | VRoid + VSeeFace | PRIMARY | MEDIUM | Workaround viable, not native integration | | ChromaDB | DEVELOPMENT | MEDIUM | Good for prototyping, evaluate Qdrant before 100k vectors | | Qdrant | PRODUCTION | HIGH | Enterprise vector DB, proven at scale | | OpenCV 4.10+ | PRIMARY | HIGH | Stable, mature ecosystem | | DeepFace emotion detection | PRIMARY | HIGH | Industry standard, 90%+ accuracy | | Python asyncio patterns | PRIMARY | HIGH | Python 3.11+ well-supported | **Confidence Interpretation**: - **HIGH**: Production-ready, API stable, no major changes expected in 2026 - **MEDIUM**: Solid choice but newer ecosystem (1-2 year old), evaluate alternatives annually - **LOW**: Emerging or unstable; prototype only --- ## Installation Checklist (Get Started) ```bash # Discord pip install discord.py>=2.6.4 # LLM & inference pip install ollama torch transformers bitsandbytes # TTS/STT pip install faster-whisper pip install sentence-transformers torch # Vector DB pip install chromadb # Vision pip install opencv-python deepface librosa # Async utilities pip install httpx aiofiles # Database pip install aiosqlite # Start services ollama serve & # (Loads models on first run) # Test basic chain python test_stack.py ``` --- ## Next Steps (For Roadmap) 1. **Phase 1**: Discord.py + Ollama + basic LLM integration (1 week) 2. **Phase 2**: STT pipeline (Whisper) + TTS (Kokoro) (1 week) 3. **Phase 3**: Memory system (SQLite + ChromaDB) (1 week) 4. **Phase 4**: Personality framework + system prompts (1 week) 5. **Phase 5**: Webcam emotion detection + context integration (1 week) 6. **Phase 6**: VRoid avatar + screen share integration (1 week) 7. **Phase 7**: Self-modification capability + safety guards (2 weeks) **Total**: ~8 weeks to full-featured Hex prototype. --- ## References & Research Sources ### Discord Integration - [Discord.py Documentation](https://discordpy.readthedocs.io/en/stable/index.html) - [Discord.py Async Patterns](https://discordpy.readthedocs.io/en/stable/ext/tasks/index.html) - [Discord.py on GitHub](https://github.com/Rapptz/discord.py) ### Local LLMs - [Llama 3.1 vs Mistral Comparison](https://kanerika.com/blogs/mistral-vs-llama-3/) - [Llama.com Quantization Guide](https://www.llama.com/docs/how-to-guides/quantization/) - [Ollama vs vLLM Deep Dive](https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking) - [Local LLM Hosting 2026 Guide](https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/) ### TTS/STT - [Whisper Large V3 2026 Benchmarks](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks/) - [Faster-Whisper GitHub](https://github.com/SYSTRAN/faster-whisper) - [Best Open Source TTS 2026](https://northflank.com/blog/best-open-source-text-to-speech-models-and-how-to-run-them) - [Whisper Streaming for Real-Time](https://github.com/ufal/whisper_streaming) ### Computer Vision - [Real-Time Facial Emotion Recognition with OpenCV](https://learnopencv.com/facial-emotion-recognition/) - [DeepFace for Emotion Detection](https://github.com/serengp/deepface) ### Vector Databases - [Vector Database Comparison 2026](https://www.datacamp.com/blog/the-top-5-vector-databases) - [ChromaDB vs Pinecone Analysis](https://www.myscale.com/blog/choosing-best-vector-database-for-your-project/) - [Chroma Documentation](https://docs.trychroma.com/) ### Python Async - [Python Asyncio for LLM Concurrency](https://www.newline.co/@zaoyang/python-asyncio-for-llm-concurrency-best-practices--bc079176) - [Asyncio Best Practices 2025](https://sparkco.ai/blog/mastering-async-best-practices-for-2025/) - [FastAPI with Asyncio](https://www.nucamp.co/blog/coding-bootcamp-backend-with-python-2025-python-in-the-backend-in-2025-leveraging-asyncio-and-fastapi-for-highperformance-systems) ### VRoid & Avatars - [VRoid Studio Official](https://vroid.com/en/studio) - [VRoid Hub API](https://vroid.pixiv.help/hc/en-us/articles/21569104969241-The-VRoid-Hub-API-is-now-live) - [VSeeFace for VRoid](https://www.vseeface.icu/) --- **Document Version**: 1.0 **Last Updated**: January 2026 **Hex Stack Status**: Ready for implementation **Estimated Implementation Time**: 8-12 weeks (to full personality bot)