docs: map existing codebase

- STACK.md - Technologies and dependencies - ARCHITECTURE.md - System design and patterns - STRUCTURE.md - Directory layout - CONVENTIONS.md - Code style and patterns - TESTING.md - Test structure - INTEGRATIONS.md - External services - CONCERNS.md - Technical debt and issues
2026-01-26 23:14:44 -05:00
parent b1d71bc22b
commit f238a958a0
7 changed files with 1667 additions and 0 deletions
--- a/.planning/codebase/ARCHITECTURE.md
+++ b/.planning/codebase/ARCHITECTURE.md
@@ -0,0 +1,177 @@
 # Architecture
 **Analysis Date:** 2026-01-26
 ## Pattern Overview
 **Overall:** Layered modular architecture with clear separation of concerns
 **Key Characteristics:**
 - Modular layer separation (Model Interface, Memory, Conversation, Interfaces, Safety, Core Personality)
 - Local-first, offline-capable design with graceful degradation
 - Plugin-like interface system allowing CLI and Discord without tight coupling
 - Sandboxed execution environment for self-improvement code
 - Bidirectional feedback loops between conversation, memory, and personality
 ## Layers
 **Model Interface (Inference Layer):**
 - Purpose: Abstract model inference operations and handle model switching
 - Location: `src/models/`
 - Contains: Model adapters, resource monitoring, context management
 - Depends on: Local Ollama/LMStudio, system resource API
 - Used by: Conversation engine, core Mai reasoning
 **Memory System (Persistence Layer):**
 - Purpose: Store and retrieve conversation history, patterns, learned behaviors
 - Location: `src/memory/`
 - Contains: SQLite operations, vector search, compression logic, pattern extraction
 - Depends on: Local SQLite database, embeddings generation
 - Used by: Conversation engine for context retrieval, personality learning
 **Conversation Engine (Reasoning Layer):**
 - Purpose: Orchestrate multi-turn conversations with context awareness
 - Location: `src/conversation/`
 - Contains: Turn handling, context window management, clarifying question logic, reasoning transparency
 - Depends on: Model Interface, Memory System, Personality System
 - Used by: Interface layers (CLI, Discord)
 **Personality System (Behavior Layer):**
 - Purpose: Enforce core values and enable personality adaptation
 - Location: `src/personality/`
 - Contains: Core personality rules, learned behavior layers, guardrails, values enforcement
 - Depends on: Configuration files (YAML), Memory System for learned patterns
 - Used by: Conversation Engine for decision making and refusal logic
 **Safety & Execution Sandbox (Security Layer):**
 - Purpose: Validate and execute generated code safely with risk assessment
 - Location: `src/safety/`
 - Contains: Risk analysis, Docker sandbox management, AST validation, audit logging
 - Depends on: Docker runtime, code analysis libraries
 - Used by: Self-improvement system for generated code execution
 **Self-Improvement System (Autonomous Layer):**
 - Purpose: Analyze own code, generate improvements, manage review and approval workflow
 - Location: `src/selfmod/`
 - Contains: Code analysis, improvement generation, review coordination, git integration
 - Depends on: Safety layer, second-agent review API, git operations, code parser
 - Used by: Core Mai autonomous operation
 **Interface Adapters (Presentation Layer):**
 - Purpose: Translate between external communication channels and core conversation engine
 - Location: `src/interfaces/`
 - Contains: CLI handler, Discord bot, message queuing, approval workflow
 - Depends on: Conversation Engine, self-improvement system
 - Used by: External communication channels (terminal, Discord)
 ## Data Flow
 **Conversation Flow:**
 1. User message arrives via interface (CLI or Discord)
 2. Message queued if offline, held in memory if online
 3. Interface adapter passes to Conversation Engine
 4. Conversation Engine queries Memory System for relevant context
 5. Context + message passed to Model Interface with system prompt (includes personality)
 6. Model generates response
 7. Response returned to Conversation Engine
 8. Conversation Engine stores turn in Memory System
 9. Response sent back through interface to user
 10. Memory System may trigger asynchronous compression if history grows
 **Self-Improvement Flow:**
 1. Self-Improvement System analyzes own code (triggered by timer or explicit request)
 2. Generates potential improvements as Python code patches
 3. Performs AST validation and basic static analysis
 4. Submits for second-agent review with risk classification
 5. If LOW risk: auto-approved, sent to Safety layer for execution
 6. If MEDIUM risk: user approval required via CLI or Discord reactions
 7. If HIGH/BLOCKED risk: blocked, logged, user notified
 8. Approved changes executed in Docker sandbox with resource limits
 9. Execution results captured, logged, committed to git with clear message
 10. Breaking changes require explicit user approval before commit
 **State Management:**
 - Conversation state: Maintained in Memory System as persisted history
 - Model state: Loaded fresh per request, no state persistence between calls
 - Personality state: Mix of code-enforced rules and learned behavior layers in Memory
 - Resource state: Monitored continuously, triggering model downgrade if limits approached
 - Approval state: Tracked in git commits, audit log, and in-memory queue
 ## Key Abstractions
 **ModelAdapter:**
 - Purpose: Abstract different model providers (Ollama local models)
 - Examples: `src/models/ollama_adapter.py`, `src/models/model_manager.py`
 - Pattern: Strategy pattern with resource-aware selection logic
 **ContextWindow:**
 - Purpose: Manage token budget and conversation history within model limits
 - Examples: `src/conversation/context_manager.py`
 - Pattern: Intelligent windowing with semantic importance weighting
 **MemoryStore:**
 - Purpose: Unified interface to conversation history, patterns, and learned behaviors
 - Examples: `src/memory/store.py`, `src/memory/vector_search.py`
 - Pattern: Repository pattern with multiple index types
 **PersonalityRules:**
 - Purpose: Encode Mai's core values as evaluable constraints
 - Examples: `src/personality/core_rules.py`, `config/personality.yaml`
 - Pattern: Rule engine with value-based decision making
 **SandboxExecutor:**
 - Purpose: Execute generated code safely with resource limits and audit trail
 - Examples: `src/safety/executor.py`, `src/safety/risk_analyzer.py`
 - Pattern: Facade wrapping Docker API with security checks
 **ApprovalWorkflow:**
 - Purpose: Coordinate user and agent approval for code changes
 - Examples: `src/interfaces/approval_handler.py`, `src/selfmod/reviewer.py`
 - Pattern: State machine with async notification coordination
 ## Entry Points
 **CLI Entry:**
 - Location: `src/interfaces/cli.py` / `__main__.py`
 - Triggers: `python -m mai` or `mai` command
 - Responsibilities: Initialize conversation session, handle user input loop, display responses, manage approval prompts
 **Discord Entry:**
 - Location: `src/interfaces/discord_bot.py`
 - Triggers: Discord message events
 - Responsibilities: Extract message context, route to conversation engine, format response, handle reactions for approvals
 **Self-Improvement Entry:**
 - Location: `src/selfmod/scheduler.py`
 - Triggers: Timer-based (periodic analysis) or explicit trigger from conversation
 - Responsibilities: Analyze code, generate improvements, initiate review workflow
 **Core Mai Entry:**
 - Location: `src/mai.py` (main class)
 - Triggers: System startup
 - Responsibilities: Initialize all systems (models, memory, personality), coordinate between layers
 ## Error Handling
 **Strategy:** Graceful degradation with clear user communication
 **Patterns:**
 - Model unavailable: Fall back to smaller model if available, notify user of reduced capabilities
 - Memory retrieval failure: Continue conversation without historical context, log error
 - Network error: Queue offline messages, retry on reconnection (Discord only)
 - Unsafe code generated: Block execution, log with risk analysis, notify user
 - Syntax error in generated code: Reject change, log, generate new proposal
 ## Cross-Cutting Concerns
 **Logging:** Structured logging with severity levels throughout codebase. Use Python `logging` module with JSON formatter for production. Log all: model selections, memory operations, safety decisions, approval workflows, code changes.
 **Validation:** Input validation at interface boundaries. AST validation for generated code. Type hints throughout codebase with mypy enforcement.
 **Authentication:** None required for local CLI. Discord bot authenticated via token (environment variable). API calls between services use simple function calls (single-process model).
 ---
 *Architecture analysis: 2026-01-26*
--- a/.planning/codebase/CONCERNS.md
+++ b/.planning/codebase/CONCERNS.md
@@ -0,0 +1,297 @@
 # Codebase Concerns
 **Analysis Date:** 2026-01-26
 ## Tech Debt
 **Incomplete Memory System Integration:**
 - Issue: Memory manager gracefully initializes but may fail silently when dependencies are missing
 - Files: `src/mai/memory/manager.py`
 - Impact: Memory features degrade ungracefully; users don't know compression or retrieval is disabled
 - Fix approach: Add explicit logging and health checks on startup, expose memory system status in CLI
 **Large Monolithic Memory Manager:**
 - Issue: MemoryManager is 1036 lines with multiple responsibilities (storage, compression, retrieval orchestration)
 - Files: `src/mai/memory/manager.py`
 - Impact: Difficult to test individual memory subsystems; changes affect multiple concerns simultaneously
 - Fix approach: Extract retrieval delegation and compression orchestration into separate coordinator classes
 **Conversation Engine Complexity:**
 - Issue: ConversationEngine is 648 lines handling timing, state, decomposition, reasoning, interruption, and metrics
 - Files: `src/mai/conversation/engine.py`
 - Impact: High cognitive load for maintainers; hard to isolate bugs in specific subsystems
 - Fix approach: Separate concerns into focused orchestrator (engine) and behavior modules (timing/reasoning/decomposition are already separated but loosely coupled)
 **Permission/Approval System Fragility:**
 - Issue: ApprovalSystem uses regex pattern matching for risk analysis with hardcoded patterns
 - Files: `src/mai/sandbox/approval_system.py`
 - Impact: Pattern-matching approach is fragile (false positives/negatives); patterns not maintainable as code evolves
 - Fix approach: Replace regex with AST-based code analysis for more reliable risk detection; move risk patterns to configuration
 **Docker Executor Dependency Chain:**
 - Issue: DockerExecutor falls back silently to unavailable state if Docker isn't installed
 - Files: `src/mai/sandbox/docker_executor.py`
 - Impact: Approval system thinks code is sandboxed when Docker is missing; security false sense of safety
 - Fix approach: Require explicit Docker availability check at startup; block code execution if Docker unavailable and user requests sandboxing
 ## Known Bugs
 **Session Persistence Restoration:**
 - Symptoms: "ConversationState object has no attribute 'set_conversation_history'" error when restarting CLI
 - Files: `src/mai/conversation/state.py`, `src/app/__main__.py`
 - Trigger: Start conversation, exit CLI, restart CLI session
 - Workaround: None - session restoration broken; users lose conversation history
 - Status: Identified in Phase 6 UAT but remediation code not applied (commit c70ee88 "Complete fresh slate" removed implementation)
 **Session File Feedback Missing:**
 - Symptoms: Users don't see where/when session files are created
 - Files: `src/app/__main__.py`
 - Trigger: Create new session or use /session command
 - Workaround: Manually check ~/.mai/session.json directory
 - Status: Identified in Phase 6 UAT as major issue (test 3 failed)
 **Resource Display Color Coding:**
 - Symptoms: Resource monitoring displays plain text instead of color-coded status indicators
 - Files: `src/app/__main__.py`
 - Trigger: Run CLI and observe resource display during conversation
 - Workaround: Parse output manually to understand resource status
 - Status: Identified in Phase 6 UAT as minor issue (test 5 failed); root cause: Rich console loses color output in non-terminal environments
 ## Security Considerations
 **Approval System Risk Analysis Insufficient:**
 - Risk: Regex-based risk detection can be bypassed with obfuscated code (e.g., string concatenation to build dangerous commands)
 - Files: `src/mai/sandbox/approval_system.py`
 - Current mitigation: Hardcoded high-risk patterns (os.system, exec, eval); fallback to block on unrecognized patterns
 - Recommendations:
  - Implement AST-based code analysis for more reliable detection
  - Add code deobfuscation step before risk analysis
  - Create risk assessment database with test cases and known bypasses
  - Require explicit docker verification before allowing code execution
 **Docker Fallback Security Gap:**
 - Risk: Code could execute without actual sandboxing if Docker unavailable, creating false sense of security
 - Files: `src/mai/sandbox/docker_executor.py`
 - Current mitigation: AuditLogger records all execution; approval system presents requests regardless
 - Recommendations:
  - Fail-safe: Block code execution if Docker unavailable and user hasn't explicitly allowed non-sandboxed execution
  - Add warning dialog explaining sandbox unavailability
  - Log all non-sandboxed execution attempts explicitly
  - Require explicit override from user with confirmation
 **Approval Preference Learning Risk:**
 - Risk: User can set "auto_allow" on risky code patterns; once learned, code execution auto-approves without user intervention
 - Files: `src/mai/sandbox/approval_system.py` (lines with `user_preferences` and `auto_allow`)
 - Current mitigation: Auto-allow only applies to LOW risk level code
 - Recommendations:
  - Require explicit user confirmation before enabling auto-allow (not just responding "a")
  - Log all auto-approved executions in audit trail with reason
  - Add periodic review mechanism for auto-allow rules (e.g., "You have X auto-approved rules, review them?" on startup)
  - Restrict auto-allow to strictly limited operation types (print, basic math, not file operations)
 ## Performance Bottlenecks
 **Memory Retrieval Search Not Optimized:**
 - Problem: ContextRetriever does full database scans for semantic similarity without indexing
 - Files: `src/mai/memory/retrieval.py`
 - Cause: Vector similarity search likely using brute-force nearest-neighbor without FAISS or similar
 - Improvement path:
  - Add FAISS vector index for semantic search acceleration
  - Implement result caching for frequent queries
  - Add search result pagination to avoid loading entire result sets
  - Benchmark retrieval latency and set targets (e.g., <500ms for top-10 similar conversations)
 **Conversation State History Accumulation:**
 - Problem: ConversationState.conversation_history grows unbounded during long sessions
 - Files: `src/mai/conversation/state.py`
 - Cause: No automatic truncation or archival of old turns; all conversation turns kept in memory
 - Improvement path:
  - Implement sliding window of recent turns (e.g., keep last 50 turns in memory)
  - Archive old turns to disk and load on demand
  - Add compression trigger at configurable message count
  - Monitor memory usage and alert when conversation history exceeds threshold
 **Memory Manager Compression Not Scheduled:**
 - Problem: Manual `compress_conversation()` calls required; no automatic compression scheduling
 - Files: `src/mai/memory/manager.py`
 - Cause: Compression is triggered manually or not at all; no background task or event-driven compression
 - Improvement path:
  - Implement background compression task triggered by conversation age or message count
  - Add periodic compression sweep for all old conversations
  - Make compression interval configurable (e.g., compress every 500 messages or 24 hours)
  - Track compression effectiveness and adjust thresholds
 ## Fragile Areas
 **Ollama Integration Dependency:**
 - Files: `src/mai/model/ollama_client.py`, `src/mai/core/interface.py`
 - Why fragile: Hard-coded Ollama endpoint assumption; no fallback model provider; no retry logic for model inference
 - Safe modification:
  - Use dependency injection for model provider (interface-based)
  - Add configurable model provider endpoints
  - Implement retry logic with exponential backoff for transient failures
  - Add model availability detection at startup
 - Test coverage: Limited tests for model switching and unavailability scenarios
 **Git Integration Fragility:**
 - Files: `src/mai/git/committer.py`, `src/mai/git/workflow.py`
 - Why fragile: Assumes clean git state; no handling for merge conflicts, detached HEAD, or dirty working directory
 - Safe modification:
  - Add pre-commit git status validation
  - Handle merge conflict detection and defer commits
  - Implement conflict resolution strategy (manual review or aborting)
  - Test against all git states (detached HEAD, dirty working tree, conflicted merge)
 - Test coverage: No tests for edge cases like merge conflicts
 **Conversation State Serialization Round-Trip:**
 - Files: `src/mai/conversation/state.py`, `src/mai/models/conversation.py`
 - Why fragile: ConversationTurn -> Ollama message -> ConversationTurn conversion can lose context
 - Safe modification:
  - Add comprehensive unit tests for serialization round-trip
  - Document serialization format and invariants
  - Add validation after deserialization (verify message count, order, role integrity)
  - Create fixture tests with edge cases (unicode, very long messages, special characters)
 - Test coverage: No existing tests for message serialization/deserialization
 **Docker Configuration Hardcoding:**
 - Files: `src/mai/sandbox/docker_executor.py`
 - Why fragile: Docker image names, CPU limits, memory limits hardcoded as class constants
 - Safe modification:
  - Move Docker config to configuration file
  - Add validation on startup that Docker limits match system resources
  - Document all Docker configuration assumptions
  - Make limits tunable per system resource profile
 - Test coverage: Docker integration tests likely mocked; no testing on actual Docker variations
 ## Scaling Limits
 **Memory Database Size Growth:**
 - Current capacity: SQLite with no explicit limits; storage grows with every conversation
 - Limit: SQLite performance degrades significantly above ~1GB; queries become slow
 - Scaling path:
  - Implement database rotation (archive old conversations, start new DB periodically)
  - Add migration path to PostgreSQL for production deployments
  - Implement automatic old conversation archival (move to cold storage after 30 days)
  - Add database vacuum and index optimization on scheduled basis
 **Conversation Context Window Management:**
 - Current capacity: Model context window determined by Ollama model selection (varies)
 - Limit: ConversationEngine doesn't prevent context overflow; will fail when history exceeds model limit
 - Scaling path:
  - Track token count of conversation history and refuse new messages before overflow
  - Implement automatic compression trigger at 80% context usage
  - Add model switching logic to use larger-context models if available
  - Document context budget requirements per model
 **Approval History Unbounded Growth:**
 - Current capacity: ApprovalSystem.approval_history list grows indefinitely
 - Limit: Memory accumulation over time; each approval decision stored in memory forever
 - Scaling path:
  - Archive approval history to database after threshold (e.g., 1000 decisions)
  - Implement approval history rotation with configurable retention
  - Add aggregate statistics (approval patterns) instead of storing raw history
  - Clean up approval history on startup or scheduled task
 ## Dependencies at Risk
 **Ollama Dependency and Model Availability:**
 - Risk: Hard requirement on Ollama being available and having models installed
 - Impact: Mai cannot function without Ollama; no fallback to cloud inference or other providers
 - Migration plan:
  - Implement abstract model provider interface
  - Add support for OpenAI/other cloud models as fallback (even if v1 is offline-first)
  - Document minimum Ollama model requirements
  - Add diagnostic tool to check Ollama health on startup
 **Docker Dependency for Sandboxing:**
 - Risk: Docker required for code execution safety; no alternative sandbox implementations
 - Impact: Users without Docker can't safely execute generated code; no graceful degradation
 - Migration plan:
  - Implement abstract executor interface (not just DockerExecutor)
  - Add noop executor for testing
  - Consider lightweight alternatives (seccomp, chroot, or bubblewrap) for Linux systems
  - Add explicit warning if Docker unavailable
 **Rich Library Terminal Detection:**
 - Risk: Rich disables colors in non-terminal environments; users see degraded UX
 - Impact: Resource monitoring and status displays lack visual feedback in non-terminal contexts
 - Migration plan:
  - Use Console(force_terminal=True) to force color output when desired
  - Add configuration option for color preference
  - Implement fallback emoji/unicode indicators for non-color environments
  - Test in various terminal emulators and SSH sessions
 ## Missing Critical Features
 **Session Data Portability:**
 - Problem: Session files are JSON but no export/import mechanism; can't backup or migrate sessions
 - Blocks: Users can't back up conversations; losing ~/.mai/session.json loses all context
 - Fix: Add export/import commands (/export, /import) and document session file format
 **Conversation Memory Persistence:**
 - Problem: Conversation history is session-scoped (stored in memory); not saved to memory system
 - Blocks: Long-term pattern learning relies on memory system but conversations aren't automatically stored
 - Fix: Implement automatic conversation archival to memory system after session ends
 **User Preference Learning Audit Trail:**
 - Problem: User preferences for auto-approval learned silently; no visibility into what patterns auto-approve
 - Blocks: Users can't audit their own auto-approval rules; hard to recover from accidentally enabling auto-allow
 - Fix: Add /preferences or /audit command to show all learned rules and allow revocation
 **Resource Constraint Graceful Degradation:**
 - Problem: System shows resource usage but doesn't adapt model selection or conversation behavior
 - Blocks: Mai can't suggest switching to smaller models when resources tight
 - Fix: Implement resource-aware model recommendation system
 **Approval Change Logging:**
 - Problem: Approval decisions not tracked in git; can't audit "who approved what when"
 - Blocks: No accountability trail for approval decisions
 - Fix: Log all approval decisions to git with commit messages including timestamp and user
 ## Test Coverage Gaps
 **Docker Executor Network Isolation:**
 - What's not tested: Whether network actually restricted in Docker containers
 - Files: `src/mai/sandbox/docker_executor.py`
 - Risk: Code might have network access despite supposed isolation
 - Priority: High (security-critical)
 **Session Persistence Edge Cases:**
 - What's not tested: Very large conversations (1000+ messages), unicode characters, special characters
 - Files: `src/mai/conversation/state.py`, session persistence code
 - Risk: Session files corrupt or lose data with edge case inputs
 - Priority: High (data loss)
 **Approval System Obfuscation Bypass:**
 - What's not tested: Obfuscated code patterns, string concatenation attacks, bytecode approaches
 - Files: `src/mai/sandbox/approval_system.py`
 - Risk: Risky code could slip through as "low risk" via obfuscation
 - Priority: High (security-critical)
 **Memory Compression Round-Trip Data Loss:**
 - What's not tested: Whether compressed conversations can be exactly reconstructed
 - Files: `src/mai/memory/compression.py`, `src/mai/memory/storage.py`
 - Risk: Compression could lose important context patterns; compression metrics may be misleading
 - Priority: Medium (data integrity)
 **Model Switching During Active Conversation:**
 - What's not tested: Switching models mid-conversation, context migration, embedding space changes
 - Files: `src/mai/model/switcher.py`, `src/mai/conversation/engine.py`
 - Risk: Context might not transfer correctly when models switch
 - Priority: Medium (feature reliability)
 **Offline Queue Conflict Resolution:**
 - What's not tested: What happens when offline messages conflict with new context when reconnecting
 - Files: `src/mai/conversation/engine.py` (offline queueing)
 - Risk: Offline messages might create incoherent conversation when reconnected
 - Priority: Medium (conversation coherence)
 **Resource Detector System Resource Edge Cases:**
 - What's not tested: GPU detection on systems with unusual hardware, CPU count on virtual systems
 - Files: `src/mai/model/resource_detector.py`
 - Risk: Wrong model selection due to misdetected resources
 - Priority: Low (graceful degradation usually handles this)
 ---
 *Concerns audit: 2026-01-26*
--- a/.planning/codebase/CONVENTIONS.md
+++ b/.planning/codebase/CONVENTIONS.md
@@ -0,0 +1,298 @@
 # Coding Conventions
 **Analysis Date:** 2026-01-26
 ## Status
 **Note:** This codebase is in planning phase. No source code has been written yet. These conventions are **prescriptive** for the Mai project and should be applied to all code from the first commit forward.
 ## Naming Patterns
 **Files:**
 - Python modules: `lowercase_with_underscores.py` (PEP 8)
 - Configuration files: `config.yaml`, `.env.example`
 - Test files: `test_module_name.py` (co-located with source)
 - Example: `src/memory/storage.py`, `src/memory/test_storage.py`
 **Functions:**
 - Use `snake_case` for all function names (PEP 8)
 - Private functions: Prefix with single underscore `_private_function()`
 - Async functions: Use `async def async_operation()` naming
 - Example: `def get_conversation_history()`, `async def stream_response()`
 **Variables:**
 - Use `snake_case` for all variable names
 - Constants: `UPPERCASE_WITH_UNDERSCORES`
 - Private module variables: Prefix with `_`
 - Example: `conversation_history`, `MAX_CONTEXT_TOKENS`, `_internal_cache`
 **Types:**
 - Classes: `PascalCase`
 - Enums: `PascalCase` (inherit from `Enum`)
 - TypedDict: `PascalCase` with `Dict` suffix
 - Example: `class ConversationManager`, `class ErrorLevel(Enum)`, `class MemoryConfigDict(TypedDict)`
 **Directories:**
 - Core modules: `src/[module_name]/` (lowercase, plural when appropriate)
 - Example: `src/models/`, `src/memory/`, `src/safety/`, `src/interfaces/`
 ## Code Style
 **Formatting:**
 - Tool: **Ruff** (formatter and linter)
 - Line length: 88 characters (Ruff default)
 - Quote style: Double quotes (`"string"`)
 - Indentation: 4 spaces (no tabs)
 **Linting:**
 - Tool: **Ruff**
 - Configuration enforced via `.ruff.toml` (when created)
 - All imports must pass ruff checks
 - No unused imports allowed
 - Type hints required for public functions
 **Python Version:**
 - Minimum: Python 3.10+
 - Use modern type hints: `from typing import *`
 - Use `str | None` instead of `Optional[str]` (union syntax)
 ## Import Organization
 **Order:**
 1. Standard library imports (`import os`, `import sys`)
 2. Third-party imports (`import discord`, `import numpy`)
 3. Local imports (`from src.memory import Storage`)
 4. Blank line between each group
 **Example:**
 ```python
 import asyncio
 import json
 from pathlib import Path
 from typing import Optional
 import discord
 from dotenv import load_dotenv
 from src.memory import ConversationStorage
 from src.models import ModelManager
 ```
 **Path Aliases:**
 - Use relative imports from `src/` root
 - Avoid deep relative imports (no `../../../`)
 - Example: `from src.safety import SandboxExecutor` not `from ...safety import SandboxExecutor`
 ## Error Handling
 **Patterns:**
 - Define domain-specific exceptions in `src/exceptions.py`
 - Use exception hierarchy (base `MaiException`, specific subclasses)
 - Always include context in exceptions (error code, details, suggestions)
 - Example:
 ```python
 class MaiException(Exception):
    """Base exception for Mai framework."""
    def __init__(self, code: str, message: str, details: dict | None = None):
        self.code = code
        self.message = message
        self.details = details or {}
        super().__init__(f"[{code}] {message}")
 class ModelError(MaiException):
    """Raised when model inference fails."""
    pass
 class MemoryError(MaiException):
    """Raised when memory operations fail."""
    pass
 ```
 - Log before raising (see Logging section)
 - Use context managers for cleanup (async context managers for async code)
 - Never catch bare `Exception` - catch specific exceptions
 ## Logging
 **Framework:** `logging` module (Python standard library)
 **Patterns:**
 - Create logger per module: `logger = logging.getLogger(__name__)`
 - Log levels guide:
  - `DEBUG`: Detailed diagnostic info (token counts, decision trees)
  - `INFO`: Significant operational events (conversation started, model loaded)
  - `WARNING`: Unexpected but handled conditions (fallback triggered, retry)
  - `ERROR`: Failed operation (model error, memory access failed)
  - `CRITICAL`: System-level failures (cannot recover)
 - Structured logging preferred (include operation context)
 - Example:
 ```python
 import logging
 logger = logging.getLogger(__name__)
 async def invoke_model(prompt: str, model: str) -> str:
    logger.debug(f"Invoking model={model} with token_count={len(prompt.split())}")
    try:
        response = await model_manager.generate(prompt)
        logger.info(f"Model response generated, length={len(response)}")
        return response
    except ModelError as e:
        logger.error(f"Model invocation failed: {e.code}", exc_info=True)
        raise
 ```
 ## Comments
 **When to Comment:**
 - Complex logic requiring explanation (multi-step algorithms, non-obvious decisions)
 - Important context that code alone cannot convey (why a workaround exists)
 - Do NOT comment obvious code (`x = 1  # set x to 1` is noise)
 - Do NOT duplicate what the code already says
 **JSDoc/Docstrings:**
 - Use Google-style docstrings for all public functions/classes
 - Include return type even if type hints exist (for readability)
 - Example:
 ```python
 async def get_memory_context(
    query: str,
    max_tokens: int = 2000,
 ) -> str:
    """Retrieve relevant memory context for a query.
    Performs vector similarity search on conversation history,
    compresses results to fit token budget, and returns formatted context.
    Args:
        query: The search query for memory retrieval.
        max_tokens: Maximum tokens in returned context (default 2000).
    Returns:
        Formatted memory context as markdown-structured string.
    Raises:
        MemoryError: If database query fails or storage is corrupted.
    """
 ```
 ## Function Design
 **Size:**
 - Target: Functions under 50 lines (hard limit: 100 lines)
 - Break complex logic into smaller helper functions
 - One responsibility per function (single responsibility principle)
 **Parameters:**
 - Maximum 4 positional parameters
 - Use keyword-only arguments for optional params: `def func(required, *, optional=None)`
 - Use dataclasses or TypedDict for complex parameter groups
 - Example:
 ```python
 # Good: Clear structure
 async def approve_change(
    change_id: str,
    *,
    reviewer_id: str,
    decision: Literal["approve", "reject"],
    reason: str | None = None,
 ) -> None:
    pass
 # Bad: Too many params
 async def approve_change(change_id, reviewer_id, decision, reason, timestamp, context, metadata):
    pass
 ```
 **Return Values:**
 - Explicitly return values (no implicit `None` returns unless documented)
 - Use `Optional[T]` or `T | None` in type hints for nullable returns
 - Prefer returning data objects over tuples: return `Result` not `(status, data, error)`
 - Async functions return awaitable, not callbacks
 ## Module Design
 **Exports:**
 - Define `__all__` in each module to be explicit about public API
 - Example in `src/memory/__init__.py`:
 ```python
 from src.memory.storage import ConversationStorage
 from src.memory.compression import MemoryCompressor
 __all__ = ["ConversationStorage", "MemoryCompressor"]
 ```
 **Barrel Files:**
 - Use `__init__.py` to export key classes/functions from submodules
 - Keep import chains shallow (max 2 levels deep)
 - Example structure:
  ```
  src/
  ├── memory/
  │   ├── __init__.py (exports Storage, Compressor)
  │   ├── storage.py
  │   └── compression.py
  ```
 **Async/Await:**
 - All I/O operations (database, API calls, file I/O) must be async
 - Use `asyncio` for concurrency, not threading
 - Async context managers for resource management:
 ```python
 async def process_request(prompt: str) -> str:
    async with model_manager.get_session() as session:
        response = await session.generate(prompt)
        return response
 ```
 ## Type Hints
 **Requirements:**
 - All public function signatures must have type hints
 - Use `from __future__ import annotations` for forward references
 - Prefer union syntax: `str | None` over `Optional[str]`
 - Use `Literal` for string enums: `Literal["approve", "reject"]`
 - Example:
 ```python
 from __future__ import annotations
 from typing import Literal
 def evaluate_risk(code: str) -> Literal["LOW", "MEDIUM", "HIGH", "BLOCKED"]:
    """Evaluate code risk level."""
    pass
 ```
 ## Configuration
 **Pattern:**
 - Use YAML for human-editable config files
 - Use environment variables for secrets (never commit `.env`)
 - Validation at import time (fail fast if config invalid)
 - Example:
 ```python
 # config.py
 import os
 from pathlib import Path
 class Config:
    DEBUG = os.getenv("DEBUG", "false").lower() == "true"
    MODELS_PATH = Path(os.getenv("MODELS_PATH", "~/.mai/models")).expanduser()
    MAX_CONTEXT_TOKENS = int(os.getenv("MAX_CONTEXT_TOKENS", "8000"))
    # Validate on import
    if not MODELS_PATH.exists():
        raise RuntimeError(f"Models path does not exist: {MODELS_PATH}")
 ```
 ---
 *Convention guide: 2026-01-26*
 *Status: Prescriptive for Mai v1 implementation*
--- a/.planning/codebase/INTEGRATIONS.md
+++ b/.planning/codebase/INTEGRATIONS.md
@@ -0,0 +1,129 @@
 # External Integrations
 **Analysis Date:** 2026-01-26
 ## APIs & External Services
 **Model Inference:**
 - LMStudio - Local model server for inference and model switching
  - SDK/Client: LMStudio Python API
  - Auth: None (local service, no authentication required)
  - Configuration: model_path env var, endpoint URL
 - Ollama - Alternative local model management system
  - SDK/Client: Ollama REST API (HTTP)
  - Auth: None (local service)
  - Purpose: Model loading, switching, inference with resource detection
 **Communication & Approvals:**
 - Discord - Bot interface for conversation and change approvals
  - SDK/Client: discord.py library
  - Auth: DISCORD_BOT_TOKEN env variable
  - Purpose: Multi-turn conversations, approval reactions (thumbs up/down), status updates
 ## Data Storage
 **Databases:**
 - SQLite3 (local file-based)
  - Connection: Local file path, no remote connection
  - Client: Python sqlite3 (stdlib) or SQLAlchemy ORM
  - Purpose: Persistent conversation history, memory compression, learned patterns
  - Location: Local filesystem (.db files)
 **File Storage:**
 - Local filesystem only - Git-tracked code changes, conversation history backups
 - No cloud storage integration in v1
 **Caching:**
 - In-memory caching for current conversation context
 - Redis: Not used in v1 (local-first constraint)
 - Model context window management: Token-based cache within model inference
 ## Authentication & Identity
 **Auth Provider:**
 - Custom local auth - No external identity provider
 - Implementation:
  - Discord user ID as conversation context identifier
  - Optional local password/PIN for CLI access
  - No OAuth/cloud identity providers (offline-first requirement)
 ## Monitoring & Observability
 **Error Tracking:**
 - None (local only, no error reporting service)
 - Local audit logging to SQLite instead
 **Logs:**
 - File-based logging to `.logs/` directory
 - Format: Structured JSON logs with timestamp, level, context
 - Rotation: Size-based or time-based rotation strategy
 - No external log aggregation (offline-first)
 ## CI/CD & Deployment
 **Hosting:**
 - Local machine only (desktop/laptop with RTX 3060+)
 - No cloud hosting in v1
 **CI Pipeline:**
 - GitHub Actions for Discord webhook on push
  - Workflow: `.github/workflows/discord_sync.yml`
  - Trigger: Push events
  - Action: POST to Discord webhook for notification
 **Git Integration:**
 - All Mai's self-modifications committed automatically with git
 - Local git repo tracking all code changes
 - Commit messages include decision context and review results
 ## Environment Configuration
 **Required env vars:**
 - `DISCORD_BOT_TOKEN` - Discord bot authentication
 - `LMSTUDIO_ENDPOINT` - LMStudio API URL (default: localhost:8000)
 - `OLLAMA_ENDPOINT` - Ollama API URL (optional alternative, default: localhost:11434)
 - `DISCORD_USER_ID` - User Discord ID for approval requests
 - `MEMORY_DB_PATH` - SQLite database file location
 - `MODEL_CACHE_DIR` - Directory for model files
 - `CPU_CORES_AVAILABLE` - System CPU count for resource management
 - `GPU_VRAM_AVAILABLE` - VRAM in GB for model selection
 - `SANDBOX_DOCKER_IMAGE` - Docker image ID for code sandbox execution
 **Secrets location:**
 - `.env` file (Python-dotenv) for local development
 - Environment variables for production/runtime
 - Git-ignored: `.env` not committed
 ## Webhooks & Callbacks
 **Incoming:**
 - Discord message webhooks - Handled by discord.py bot event listeners
 - No external webhook endpoints in v1
 **Outgoing:**
 - Discord webhook for git notifications (configured in GitHub Actions)
 - Endpoint: Stored in GitHub secrets as WEBHOOK
 - Triggered on: git push events
 - Payload: Git commit information (author, message, timestamp)
 **Model Callback Handling:**
 - LMStudio streaming callbacks for token-by-token responses
 - Ollama streaming responses for incremental model output
 ## Code Execution Sandbox
 **Sandbox Environment:**
 - Docker container with resource limits
  - SDK: Docker SDK for Python (docker-py)
  - Environment: Isolated Linux container
  - Resource limits: CPU cores, RAM, network restrictions
 **Risk Assessment:**
 - Multi-level risk evaluation (LOW/MEDIUM/HIGH/BLOCKED)
 - AST validation before container execution
 - Second-agent review via Claude/OpenCode API
 ---
 *Integration audit: 2026-01-26*
--- a/.planning/codebase/STACK.md
+++ b/.planning/codebase/STACK.md
@@ -0,0 +1,93 @@
 # Technology Stack
 **Analysis Date:** 2026-01-26
 ## Languages
 **Primary:**
 - Python 3.x - Core Mai agent codebase, local model inference, self-improvement system
 **Secondary:**
 - YAML - Configuration files for personality, behavior settings
 - JSON - Configuration, metadata, API responses
 - SQL - Memory storage and retrieval queries
 ## Runtime
 **Environment:**
 - Python (local execution, no remote runtime)
 - LMStudio or Ollama - Local model inference server
 **Package Manager:**
 - pip - Python package management
 - Lockfile: requirements.txt or poetry.lock (typical Python approach)
 ## Frameworks
 **Core:**
 - No web framework for v1 (CLI/Discord only)
 **Model Inference:**
 - LMStudio Python SDK - Local model switching and inference
 - Ollama API - Alternative local model management per requirements
 **Discord Integration:**
 - discord.py - Discord bot API client
 **CLI:**
 - Click or Typer - Command-line interface building
 **Testing:**
 - pytest - Unit/integration test framework
 - pytest-asyncio - Async test support for Discord bot testing
 **Build/Dev:**
 - Git - Version control for Mai's own code changes
 - Docker - Sandbox execution environment for safety
 ## Key Dependencies
 **Critical:**
 - LMStudio Python Client - Model loading, switching, inference with token management
 - discord.py - Discord bot functionality for approval workflows
 - SQLite3 - Lightweight persistent storage (Python stdlib)
 - Docker SDK for Python - Sandbox execution management
 **Infrastructure:**
 - requests - HTTP client for Discord API fallback and Ollama API communication
 - PyYAML - Personality configuration parsing
 - pydantic - Data validation for internal structures
 - python-dotenv - Environment variable management for secrets
 - GitPython - Programmatic git operations for committing self-improvements
 ## Configuration
 **Environment:**
 - .env file - Discord bot token, model paths, resource thresholds
 - environment variables - Runtime configuration loaded at startup
 - personality.yaml - Core personality values and learned behavior layers
 - config.json - Resource limits, model preferences, memory settings
 **Build:**
 - setup.py or pyproject.toml - Package metadata and dependency declaration
 - Dockerfile - Sandbox execution environment specification
 - .dockerignore - Docker build optimization
 ## Platform Requirements
 **Development:**
 - Python 3.8+ (for type hints and async/await)
 - Git (for version control and self-modification tracking)
 - Docker (for sandbox execution environment)
 - LMStudio or Ollama running locally (for model inference)
 **Production (Runtime):**
 - RTX 3060 GPU minimum (per project constraints)
 - 16GB+ RAM (for model loading and context management)
 - Linux/macOS/Windows with Python 3.8+
 - Docker daemon (for sandboxed code execution)
 - Local LMStudio/Ollama instance (no cloud models)
 ---
 *Stack analysis: 2026-01-26*
--- a/.planning/codebase/STRUCTURE.md
+++ b/.planning/codebase/STRUCTURE.md
@@ -0,0 +1,258 @@
 # Codebase Structure
 **Analysis Date:** 2026-01-26
 ## Directory Layout
 ```
 mai/
 ├── src/
 │   ├── __main__.py              # CLI entry point
 │   ├── mai.py                   # Core Mai class, orchestration
 │   ├── models/
 │   │   ├── __init__.py
 │   │   ├── adapter.py           # Base model adapter interface
 │   │   ├── ollama_adapter.py    # Ollama/LMStudio implementation
 │   │   ├── model_manager.py     # Model selection and switching logic
 │   │   └── resource_monitor.py  # CPU, RAM, GPU tracking
 │   ├── memory/
 │   │   ├── __init__.py
 │   │   ├── store.py             # SQLite conversation store
 │   │   ├── vector_search.py     # Semantic similarity search
 │   │   ├── compression.py       # History compression and summarization
 │   │   └── pattern_extractor.py # Learning and pattern recognition
 │   ├── conversation/
 │   │   ├── __init__.py
 │   │   ├── engine.py            # Main conversation orchestration
 │   │   ├── context_manager.py   # Token budget and window management
 │   │   ├── turn_handler.py      # Single turn processing
 │   │   └── reasoning.py         # Reasoning transparency and clarification
 │   ├── personality/
 │   │   ├── __init__.py
 │   │   ├── core_rules.py        # Unshakeable core values enforcement
 │   │   ├── learned_behaviors.py # Personality adaptation from interactions
 │   │   ├── guardrails.py        # Safety constraints and refusal logic
 │   │   └── config_loader.py     # YAML personality configuration
 │   ├── safety/
 │   │   ├── __init__.py
 │   │   ├── executor.py          # Docker sandbox execution wrapper
 │   │   ├── risk_analyzer.py     # Risk classification (LOW/MEDIUM/HIGH/BLOCKED)
 │   │   ├── ast_validator.py     # Syntax and import validation
 │   │   └── audit_log.py         # Immutable execution history
 │   ├── selfmod/
 │   │   ├── __init__.py
 │   │   ├── analyzer.py          # Code analysis and improvement detection
 │   │   ├── generator.py         # Improvement code generation
 │   │   ├── scheduler.py         # Periodic and on-demand analysis trigger
 │   │   ├── reviewer.py          # Second-agent review coordination
 │   │   └── git_manager.py       # Git commit integration
 │   ├── interfaces/
 │   │   ├── __init__.py
 │   │   ├── cli.py               # CLI chat interface
 │   │   ├── discord_bot.py       # Discord bot implementation
 │   │   ├── message_handler.py   # Shared message processing
 │   │   ├── approval_handler.py  # Change approval workflow
 │   │   └── offline_queue.py     # Message queueing during disconnection
 │   └── utils/
 │       ├── __init__.py
 │       ├── config.py            # Configuration loading
 │       ├── logging.py           # Structured logging setup
 │       ├── validators.py        # Input validation helpers
 │       └── helpers.py           # Shared utility functions
 ├── config/
 │   ├── personality.yaml         # Core personality configuration
 │   ├── models.yaml              # Model definitions and resource limits
 │   ├── safety_rules.yaml        # Risk assessment rules
 │   └── logging.yaml             # Logging configuration
 ├── tests/
 │   ├── unit/
 │   │   ├── test_models.py
 │   │   ├── test_memory.py
 │   │   ├── test_conversation.py
 │   │   ├── test_personality.py
 │   │   ├── test_safety.py
 │   │   └── test_selfmod.py
 │   ├── integration/
 │   │   ├── test_conversation_flow.py
 │   │   ├── test_selfmod_workflow.py
 │   │   └── test_interfaces.py
 │   └── fixtures/
 │       ├── mock_models.py
 │       ├── test_data.py
 │       └── sample_conversations.json
 ├── scripts/
 │   ├── setup_ollama.py          # Initial model downloading
 │   ├── init_db.py               # Database schema initialization
 │   └── verify_environment.py    # Pre-flight checks
 ├── docker/
 │   └── Dockerfile               # Sandbox execution environment
 ├── .env.example                 # Environment variables template
 ├── pyproject.toml               # Project metadata and dependencies
 ├── requirements.txt             # Python dependencies
 ├── pytest.ini                   # Test configuration
 ├── Makefile                     # Development commands
 └── README.md                    # Project overview
 ```
 ## Directory Purposes
 **src/:**
 - Purpose: All application code
 - Contains: Python modules organized by architectural layer
 - Key files: `mai.py` (core), `__main__.py` (CLI entry)
 **src/models/:**
 - Purpose: Model inference abstraction
 - Contains: Adapter interfaces, Ollama client, resource monitoring
 - Key files: `model_manager.py` (selection logic), `resource_monitor.py` (constraints)
 **src/memory/:**
 - Purpose: Persistent storage and retrieval
 - Contains: SQLite operations, vector search, compression
 - Key files: `store.py` (main interface), `vector_search.py` (semantic search)
 **src/conversation/:**
 - Purpose: Multi-turn conversation orchestration
 - Contains: Turn handling, context windowing, reasoning transparency
 - Key files: `engine.py` (main coordinator), `context_manager.py` (token budget)
 **src/personality/:**
 - Purpose: Values enforcement and personality adaptation
 - Contains: Core rules, learned behaviors, guardrails
 - Key files: `core_rules.py` (unshakeable values), `learned_behaviors.py` (adaptation)
 **src/safety/:**
 - Purpose: Code execution sandboxing and risk assessment
 - Contains: Docker wrapper, AST validation, risk classification, audit logging
 - Key files: `executor.py` (sandbox wrapper), `risk_analyzer.py` (classification)
 **src/selfmod/:**
 - Purpose: Autonomous code improvement and review
 - Contains: Code analysis, improvement generation, approval workflow
 - Key files: `analyzer.py` (detection), `reviewer.py` (second-agent coordination)
 **src/interfaces/:**
 - Purpose: External communication adapters
 - Contains: CLI handler, Discord bot, approval system
 - Key files: `cli.py` (terminal UI), `discord_bot.py` (Discord integration)
 **src/utils/:**
 - Purpose: Shared utilities and helpers
 - Contains: Configuration loading, logging, validation
 - Key files: `config.py` (env/file loading), `logging.py` (structured logs)
 **config/:**
 - Purpose: Non-code configuration files
 - Contains: YAML personality, models, safety rules definitions
 - Key files: `personality.yaml` (core values), `models.yaml` (resource profiles)
 **tests/:**
 - Purpose: Test suites organized by type
 - Contains: Unit tests (layer isolation), integration tests (flows), fixtures (test data)
 - Key files: Each test file mirrors `src/` structure
 **scripts/:**
 - Purpose: One-off setup and maintenance scripts
 - Contains: Database initialization, environment verification
 - Key files: `setup_ollama.py` (first-time model setup)
 **docker/:**
 - Purpose: Container configuration for sandboxed execution
 - Contains: Dockerfile for isolation environment
 - Key files: `Dockerfile` (build recipe)
 ## Key File Locations
 **Entry Points:**
 - `src/__main__.py`: CLI entry, `python -m mai` launches here
 - `src/interfaces/discord_bot.py`: Discord bot main loop
 - `src/mai.py`: Core Mai class, system initialization
 **Configuration:**
 - `config/personality.yaml`: Core values, interaction patterns, refusal rules
 - `config/models.yaml`: Available models, resource requirements, context windows
 - `.env.example`: Required environment variables template
 **Core Logic:**
 - `src/mai.py`: Main orchestration
 - `src/conversation/engine.py`: Conversation turn processing
 - `src/selfmod/analyzer.py`: Improvement opportunity detection
 - `src/safety/executor.py`: Safe code execution
 **Testing:**
 - `tests/unit/`: Layer-isolated tests (no dependencies between layers)
 - `tests/integration/`: End-to-end flow tests
 - `tests/fixtures/`: Mock objects and test data
 ## Naming Conventions
 **Files:**
 - Module files: `snake_case.py` (e.g., `model_manager.py`)
 - Entry points: `__main__.py` for packages, standalone scripts at package root
 - Config files: `snake_case.yaml` (e.g., `personality.yaml`)
 - Test files: `test_*.py` (e.g., `test_conversation.py`)
 **Directories:**
 - Feature areas: `snake_case` (e.g., `src/selfmod/`)
 - No abbreviations except `selfmod` (self-modification) which is project standard
 - Each layer is a top-level directory under `src/`
 **Functions/Classes:**
 - Classes: `PascalCase` (e.g., `ModelManager`, `ConversationEngine`)
 - Functions: `snake_case` (e.g., `generate_response()`, `validate_code()`)
 - Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_CONTEXT_TOKENS`)
 - Private methods/functions: prefix with `_` (e.g., `_internal_method()`)
 **Types:**
 - Use type hints throughout: `def process(msg: str) -> str:`
 - Complex types in `src/utils/types.py` or local to module
 ## Where to Add New Code
 **New Feature (e.g., new communication interface like Slack):**
 - Primary code: `src/interfaces/slack_adapter.py` (new adapter following discord_bot.py pattern)
 - Tests: `tests/unit/test_slack_adapter.py` and `tests/integration/test_slack_interface.py`
 - Configuration: Add to `src/interfaces/__init__.py` imports and `config/interfaces.yaml` if needed
 - Entry hook: Modify `src/mai.py` to initialize new adapter
 **New Component/Module (e.g., advanced memory with graph databases):**
 - Implementation: `src/memory/graph_store.py` (new module in appropriate layer)
 - Interface: Follow existing patterns (e.g., inherit from `src/memory/store.py` base)
 - Tests: Corresponding test in `tests/unit/test_memory.py` or new file if complex
 - Integration: Modify `src/mai.py` initialization to use new component with feature flag
 **Utilities (e.g., new helper function):**
 - Shared helpers: `src/utils/helpers.py` (functions) or new file like `src/utils/math_utils.py` if substantial
 - Internal helpers: Keep in the module where used (don't over-extract)
 - Tests: Add to `tests/unit/test_utils.py`
 **Configuration:**
 - Static rules: Add to appropriate YAML in `config/`
 - Dynamic config: Load in `src/utils/config.py`
 - Env-driven: Add to `.env.example` with documentation
 ## Special Directories
 **tests/fixtures/:**
 - Purpose: Reusable test data and mock objects
 - Generated: No, hand-created
 - Committed: Yes, part of repository
 **config/:**
 - Purpose: Non-code configuration
 - Generated: No, hand-maintained
 - Committed: Yes, except secrets (use `.env`)
 **.env (not committed):**
 - Purpose: Local environment overrides and secrets
 - Generated: No, copied from `.env.example` and filled locally
 - Committed: No (in .gitignore)
 **docker/:**
 - Purpose: Sandbox environment for safe execution
 - Generated: No, hand-maintained
 - Committed: Yes
 ---
 *Structure analysis: 2026-01-26*
--- a/.planning/codebase/TESTING.md
+++ b/.planning/codebase/TESTING.md
@@ -0,0 +1,415 @@
 # Testing Patterns
 **Analysis Date:** 2026-01-26
 ## Status
 **Note:** This codebase is in planning phase. No tests have been written yet. These patterns are **prescriptive** for the Mai project and should be applied from the first test file forward.
 ## Test Framework
 **Runner:**
 - **pytest** - Test discovery and execution
 - Version: Latest stable (6.x or higher)
 - Config: `pytest.ini` or `pyproject.toml` (create with initial setup)
 **Assertion Library:**
 - Built-in `assert` statements
 - `pytest` fixtures for setup/teardown
 - `pytest.raises()` for exception testing
 **Run Commands:**
 ```bash
 pytest                          # Run all tests in tests/ directory
 pytest -v                       # Verbose output with test names
 pytest -k "test_memory"         # Run tests matching pattern
 pytest --cov=src                # Generate coverage report
 pytest --cov=src --cov-report=html  # Generate HTML coverage
 pytest -x                       # Stop on first failure
 pytest -s                       # Show print output during tests
 ```
 ## Test File Organization
 **Location:**
 - **Co-located pattern**: Test files live next to source files
 - Structure: `src/[module]/test_[component].py`
 - All tests in a single directory: `tests/` with mirrored structure
 **Recommended pattern for Mai:**
 ```
 src/
 ├── memory/
 │   ├── __init__.py
 │   ├── storage.py
 │   └── test_storage.py          # Co-located tests
 ├── models/
 │   ├── __init__.py
 │   ├── manager.py
 │   └── test_manager.py
 └── safety/
    ├── __init__.py
    ├── sandbox.py
    └── test_sandbox.py
 ```
 **Naming:**
 - Test files: `test_*.py` or `*_test.py`
 - Test classes: `TestComponentName`
 - Test functions: `test_specific_behavior_with_context`
 - Example: `test_retrieves_conversation_history_within_token_limit`
 **Test Organization:**
 - One test class per component being tested
 - Group related tests in a single class
 - One assertion per test (or tightly related assertions)
 ## Test Structure
 **Suite Organization:**
 ```python
 import pytest
 from src.memory.storage import ConversationStorage
 class TestConversationStorage:
    """Test suite for ConversationStorage."""
    @pytest.fixture
    def storage(self) -> ConversationStorage:
        """Provide a storage instance for testing."""
        return ConversationStorage(path=":memory:")  # Use in-memory DB
    @pytest.fixture
    def sample_conversation(self) -> dict:
        """Provide sample conversation data."""
        return {
            "messages": [
                {"role": "user", "content": "Hello"},
                {"role": "assistant", "content": "Hi there"},
            ]
        }
    def test_stores_and_retrieves_conversation(self, storage, sample_conversation):
        """Test that conversations can be stored and retrieved."""
        conversation_id = storage.store(sample_conversation)
        retrieved = storage.get(conversation_id)
        assert retrieved == sample_conversation
    def test_raises_error_on_missing_conversation(self, storage):
        """Test that missing conversations raise appropriate error."""
        with pytest.raises(MemoryError):
            storage.get("nonexistent_id")
 ```
 **Patterns:**
 - **Setup pattern**: Use `@pytest.fixture` for setup, avoid `setUp()` methods
 - **Teardown pattern**: Use fixture cleanup (yield pattern)
 - **Assertion pattern**: One logical assertion per test (may involve multiple `assert` statements on related data)
 ```python
@pytest.fixture
 def model_manager():
    """Set up model manager and clean up after test."""
    manager = ModelManager()
    manager.initialize()
    yield manager
    manager.shutdown()  # Cleanup
 def test_loads_available_models(model_manager):
    """Test model discovery and loading."""
    models = model_manager.list_available()
    assert len(models) > 0
    assert all(isinstance(m, str) for m in models)
 ```
 ## Async Testing
 **Pattern:**
 ```python
 import pytest
 import asyncio
@pytest.mark.asyncio
 async def test_async_model_invocation():
    """Test async model inference."""
    manager = ModelManager()
    response = await manager.generate("test prompt")
    assert len(response) > 0
    assert isinstance(response, str)
@pytest.mark.asyncio
 async def test_concurrent_memory_access():
    """Test that memory handles concurrent access."""
    storage = ConversationStorage()
    tasks = [
        storage.store({"id": i, "text": f"msg {i}"})
        for i in range(10)
    ]
    ids = await asyncio.gather(*tasks)
    assert len(ids) == 10
 ```
 - Use `@pytest.mark.asyncio` decorator
 - Use `async def` for test function signature
 - Use `await` for async calls
 - Can mix async fixtures and sync fixtures
 ## Mocking
 **Framework:** `unittest.mock` (Python standard library)
 **Patterns:**
 ```python
 from unittest.mock import Mock, AsyncMock, patch, MagicMock
 import pytest
 def test_handles_model_error():
    """Test error handling when model fails."""
    mock_model = Mock()
    mock_model.generate.side_effect = RuntimeError("Model offline")
    manager = ModelManager(model=mock_model)
    with pytest.raises(ModelError):
        manager.invoke("prompt")
@pytest.mark.asyncio
 async def test_retries_on_transient_failure():
    """Test retry logic for transient failures."""
    mock_api = AsyncMock()
    mock_api.call.side_effect = [
        Exception("Temporary failure"),
        "success"
    ]
    result = await retry_with_backoff(mock_api.call, max_retries=2)
    assert result == "success"
    assert mock_api.call.call_count == 2
@patch("src.models.manager.requests.get")
 def test_fetches_model_list(mock_get):
    """Test fetching model list from API."""
    mock_get.return_value.json.return_value = {"models": ["model1", "model2"]}
    manager = ModelManager()
    models = manager.get_remote_models()
    assert models == ["model1", "model2"]
 ```
 **What to Mock:**
 - External API calls (Discord, LMStudio API)
 - Database operations (SQLite in production, use in-memory for tests)
 - File I/O (use temporary directories)
 - Slow operations (model inference can be stubbed)
 - System resources (CPU, RAM monitoring)
 **What NOT to Mock:**
 - Core business logic (the logic you're testing)
 - Data structure operations (dict, list operations)
 - Internal module calls within the same component
 - Internal helper functions
 ## Fixtures and Factories
 **Test Data Pattern:**
 ```python
 # conftest.py - shared fixtures
 import pytest
 from pathlib import Path
 from src.memory.storage import ConversationStorage
@pytest.fixture
 def temp_db():
    """Provide a temporary SQLite database."""
    db_path = Path("/tmp/test_mai.db")
    yield db_path
    if db_path.exists():
        db_path.unlink()
@pytest.fixture
 def conversation_factory():
    """Factory for creating test conversations."""
    def _make_conversation(num_messages: int = 3) -> dict:
        messages = []
        for i in range(num_messages):
            role = "user" if i % 2 == 0 else "assistant"
            messages.append({
                "role": role,
                "content": f"Message {i+1}",
                "timestamp": f"2026-01-26T{i:02d}:00:00Z"
            })
        return {"messages": messages}
    return _make_conversation
 def test_stores_long_conversation(temp_db, conversation_factory):
    """Test storing conversations with many messages."""
    storage = ConversationStorage(path=temp_db)
    long_convo = conversation_factory(num_messages=100)
    conv_id = storage.store(long_convo)
    retrieved = storage.get(conv_id)
    assert len(retrieved["messages"]) == 100
 ```
 **Location:**
 - Shared fixtures: `tests/conftest.py` (pytest auto-discovers)
 - Component-specific fixtures: In test files or subdirectory `conftest.py` files
 - Factories: In `tests/factories.py` or within `conftest.py`
 ## Coverage
 **Requirements:**
 - **Target: 80% code coverage minimum** for core modules
 - Critical paths (safety, memory, inference): 90%+ coverage
 - UI/CLI: 70% (lower due to interaction complexity)
 **View Coverage:**
 ```bash
 pytest --cov=src --cov-report=term-missing
 pytest --cov=src --cov-report=html
 # Then open htmlcov/index.html in browser
 ```
 **Configure in `pyproject.toml`:**
 ```toml
 [tool.pytest.ini_options]
 testpaths = ["src", "tests"]
 addopts = "--cov=src --cov-report=term-missing --cov-report=html"
 ```
 ## Test Types
 **Unit Tests:**
 - Scope: Single function or class method
 - Dependencies: Mocked
 - Speed: Fast (<100ms per test)
 - Location: `test_component.py` in source directory
 - Example: `test_tokenizer_splits_input_correctly`
 **Integration Tests:**
 - Scope: Multiple components working together
 - Dependencies: Real services (in-memory DB, local files)
 - Speed: Medium (100ms - 1s per test)
 - Location: `tests/integration/test_*.py`
 - Example: `test_conversation_engine_with_memory_retrieval`
 ```python
 # tests/integration/test_conversation_flow.py
@pytest.mark.asyncio
 async def test_full_conversation_with_memory():
    """Test complete conversation flow including memory retrieval."""
    memory = ConversationStorage(path=":memory:")
    engine = ConversationEngine(memory=memory)
    # Store context
    memory.store({"id": "ctx1", "content": "User prefers Python"})
    # Have conversation
    response = await engine.chat("What language should I use?")
    # Verify context was used
    assert "Python" in response or "python" in response.lower()
 ```
 **E2E Tests:**
 - Scope: Full system end-to-end
 - Framework: **Not required for v1** (added in v2)
 - Would test: CLI input → Model → Discord output
 - Deferred until Discord/CLI interfaces complete
 ## Common Patterns
 **Error Testing:**
 ```python
 def test_invalid_input_raises_validation_error():
    """Test that validation catches malformed input."""
    with pytest.raises(ValueError) as exc_info:
        storage.store({"invalid": "structure"})
    assert "missing required field" in str(exc_info.value)
 def test_logs_error_details():
    """Test that errors log useful debugging info."""
    with patch("src.logger") as mock_logger:
        try:
            risky_operation()
        except OperationError:
            pass
        mock_logger.error.assert_called_once()
        call_args = mock_logger.error.call_args
        assert "operation_id" in str(call_args)
 ```
 **Performance Testing:**
 ```python
 def test_memory_retrieval_within_performance_budget(benchmark):
    """Test that memory queries complete within time budget."""
    storage = ConversationStorage()
    query = "what did we discuss earlier"
    result = benchmark(storage.retrieve_similar, query)
    assert len(result) > 0
 # Run with: pytest --benchmark-only
 ```
 **Data Validation Testing:**
 ```python
@pytest.mark.parametrize("input_val,expected", [
    ("hello", "hello"),
    ("HELLO", "hello"),
    ("  hello  ", "hello"),
    ("", ValueError),
 ])
 def test_normalizes_input(input_val, expected):
    """Test input normalization with multiple cases."""
    if isinstance(expected, type) and issubclass(expected, Exception):
        with pytest.raises(expected):
            normalize(input_val)
    else:
        assert normalize(input_val) == expected
 ```
 ## Configuration
 **pytest.ini (create at project root):**
 ```ini
 [pytest]
 testpaths = src tests
 addopts = -v --tb=short --strict-markers
 markers =
    asyncio: marks async tests
    slow: marks slow tests
    integration: marks integration tests
 ```
 **Alternative: pyproject.toml:**
 ```toml
 [tool.pytest.ini_options]
 testpaths = ["src", "tests"]
 addopts = "-v --tb=short"
 markers = [
    "asyncio: async test",
    "slow: slow test",
    "integration: integration test",
 ]
 ```
 ## Test Execution in CI/CD
 **GitHub Actions workflow (when created):**
 ```yaml
 - name: Run tests
  run: pytest --cov=src --cov-report=xml
 - name: Upload coverage
  uses: codecov/codecov-action@v3
  with:
    files: ./coverage.xml
 ```
 ---
 *Testing guide: 2026-01-26*
 *Status: Prescriptive for Mai v1 implementation*