diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md new file mode 100644 index 0000000..0fdf66e --- /dev/null +++ b/.planning/codebase/ARCHITECTURE.md @@ -0,0 +1,177 @@ +# Architecture + +**Analysis Date:** 2026-01-26 + +## Pattern Overview + +**Overall:** Layered modular architecture with clear separation of concerns + +**Key Characteristics:** +- Modular layer separation (Model Interface, Memory, Conversation, Interfaces, Safety, Core Personality) +- Local-first, offline-capable design with graceful degradation +- Plugin-like interface system allowing CLI and Discord without tight coupling +- Sandboxed execution environment for self-improvement code +- Bidirectional feedback loops between conversation, memory, and personality + +## Layers + +**Model Interface (Inference Layer):** +- Purpose: Abstract model inference operations and handle model switching +- Location: `src/models/` +- Contains: Model adapters, resource monitoring, context management +- Depends on: Local Ollama/LMStudio, system resource API +- Used by: Conversation engine, core Mai reasoning + +**Memory System (Persistence Layer):** +- Purpose: Store and retrieve conversation history, patterns, learned behaviors +- Location: `src/memory/` +- Contains: SQLite operations, vector search, compression logic, pattern extraction +- Depends on: Local SQLite database, embeddings generation +- Used by: Conversation engine for context retrieval, personality learning + +**Conversation Engine (Reasoning Layer):** +- Purpose: Orchestrate multi-turn conversations with context awareness +- Location: `src/conversation/` +- Contains: Turn handling, context window management, clarifying question logic, reasoning transparency +- Depends on: Model Interface, Memory System, Personality System +- Used by: Interface layers (CLI, Discord) + +**Personality System (Behavior Layer):** +- Purpose: Enforce core values and enable personality adaptation +- Location: `src/personality/` +- Contains: Core personality rules, learned behavior layers, guardrails, values enforcement +- Depends on: Configuration files (YAML), Memory System for learned patterns +- Used by: Conversation Engine for decision making and refusal logic + +**Safety & Execution Sandbox (Security Layer):** +- Purpose: Validate and execute generated code safely with risk assessment +- Location: `src/safety/` +- Contains: Risk analysis, Docker sandbox management, AST validation, audit logging +- Depends on: Docker runtime, code analysis libraries +- Used by: Self-improvement system for generated code execution + +**Self-Improvement System (Autonomous Layer):** +- Purpose: Analyze own code, generate improvements, manage review and approval workflow +- Location: `src/selfmod/` +- Contains: Code analysis, improvement generation, review coordination, git integration +- Depends on: Safety layer, second-agent review API, git operations, code parser +- Used by: Core Mai autonomous operation + +**Interface Adapters (Presentation Layer):** +- Purpose: Translate between external communication channels and core conversation engine +- Location: `src/interfaces/` +- Contains: CLI handler, Discord bot, message queuing, approval workflow +- Depends on: Conversation Engine, self-improvement system +- Used by: External communication channels (terminal, Discord) + +## Data Flow + +**Conversation Flow:** + +1. User message arrives via interface (CLI or Discord) +2. Message queued if offline, held in memory if online +3. Interface adapter passes to Conversation Engine +4. Conversation Engine queries Memory System for relevant context +5. Context + message passed to Model Interface with system prompt (includes personality) +6. Model generates response +7. Response returned to Conversation Engine +8. Conversation Engine stores turn in Memory System +9. Response sent back through interface to user +10. Memory System may trigger asynchronous compression if history grows + +**Self-Improvement Flow:** + +1. Self-Improvement System analyzes own code (triggered by timer or explicit request) +2. Generates potential improvements as Python code patches +3. Performs AST validation and basic static analysis +4. Submits for second-agent review with risk classification +5. If LOW risk: auto-approved, sent to Safety layer for execution +6. If MEDIUM risk: user approval required via CLI or Discord reactions +7. If HIGH/BLOCKED risk: blocked, logged, user notified +8. Approved changes executed in Docker sandbox with resource limits +9. Execution results captured, logged, committed to git with clear message +10. Breaking changes require explicit user approval before commit + +**State Management:** +- Conversation state: Maintained in Memory System as persisted history +- Model state: Loaded fresh per request, no state persistence between calls +- Personality state: Mix of code-enforced rules and learned behavior layers in Memory +- Resource state: Monitored continuously, triggering model downgrade if limits approached +- Approval state: Tracked in git commits, audit log, and in-memory queue + +## Key Abstractions + +**ModelAdapter:** +- Purpose: Abstract different model providers (Ollama local models) +- Examples: `src/models/ollama_adapter.py`, `src/models/model_manager.py` +- Pattern: Strategy pattern with resource-aware selection logic + +**ContextWindow:** +- Purpose: Manage token budget and conversation history within model limits +- Examples: `src/conversation/context_manager.py` +- Pattern: Intelligent windowing with semantic importance weighting + +**MemoryStore:** +- Purpose: Unified interface to conversation history, patterns, and learned behaviors +- Examples: `src/memory/store.py`, `src/memory/vector_search.py` +- Pattern: Repository pattern with multiple index types + +**PersonalityRules:** +- Purpose: Encode Mai's core values as evaluable constraints +- Examples: `src/personality/core_rules.py`, `config/personality.yaml` +- Pattern: Rule engine with value-based decision making + +**SandboxExecutor:** +- Purpose: Execute generated code safely with resource limits and audit trail +- Examples: `src/safety/executor.py`, `src/safety/risk_analyzer.py` +- Pattern: Facade wrapping Docker API with security checks + +**ApprovalWorkflow:** +- Purpose: Coordinate user and agent approval for code changes +- Examples: `src/interfaces/approval_handler.py`, `src/selfmod/reviewer.py` +- Pattern: State machine with async notification coordination + +## Entry Points + +**CLI Entry:** +- Location: `src/interfaces/cli.py` / `__main__.py` +- Triggers: `python -m mai` or `mai` command +- Responsibilities: Initialize conversation session, handle user input loop, display responses, manage approval prompts + +**Discord Entry:** +- Location: `src/interfaces/discord_bot.py` +- Triggers: Discord message events +- Responsibilities: Extract message context, route to conversation engine, format response, handle reactions for approvals + +**Self-Improvement Entry:** +- Location: `src/selfmod/scheduler.py` +- Triggers: Timer-based (periodic analysis) or explicit trigger from conversation +- Responsibilities: Analyze code, generate improvements, initiate review workflow + +**Core Mai Entry:** +- Location: `src/mai.py` (main class) +- Triggers: System startup +- Responsibilities: Initialize all systems (models, memory, personality), coordinate between layers + +## Error Handling + +**Strategy:** Graceful degradation with clear user communication + +**Patterns:** +- Model unavailable: Fall back to smaller model if available, notify user of reduced capabilities +- Memory retrieval failure: Continue conversation without historical context, log error +- Network error: Queue offline messages, retry on reconnection (Discord only) +- Unsafe code generated: Block execution, log with risk analysis, notify user +- Syntax error in generated code: Reject change, log, generate new proposal + +## Cross-Cutting Concerns + +**Logging:** Structured logging with severity levels throughout codebase. Use Python `logging` module with JSON formatter for production. Log all: model selections, memory operations, safety decisions, approval workflows, code changes. + +**Validation:** Input validation at interface boundaries. AST validation for generated code. Type hints throughout codebase with mypy enforcement. + +**Authentication:** None required for local CLI. Discord bot authenticated via token (environment variable). API calls between services use simple function calls (single-process model). + +--- + +*Architecture analysis: 2026-01-26* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md new file mode 100644 index 0000000..be209cb --- /dev/null +++ b/.planning/codebase/CONCERNS.md @@ -0,0 +1,297 @@ +# Codebase Concerns + +**Analysis Date:** 2026-01-26 + +## Tech Debt + +**Incomplete Memory System Integration:** +- Issue: Memory manager gracefully initializes but may fail silently when dependencies are missing +- Files: `src/mai/memory/manager.py` +- Impact: Memory features degrade ungracefully; users don't know compression or retrieval is disabled +- Fix approach: Add explicit logging and health checks on startup, expose memory system status in CLI + +**Large Monolithic Memory Manager:** +- Issue: MemoryManager is 1036 lines with multiple responsibilities (storage, compression, retrieval orchestration) +- Files: `src/mai/memory/manager.py` +- Impact: Difficult to test individual memory subsystems; changes affect multiple concerns simultaneously +- Fix approach: Extract retrieval delegation and compression orchestration into separate coordinator classes + +**Conversation Engine Complexity:** +- Issue: ConversationEngine is 648 lines handling timing, state, decomposition, reasoning, interruption, and metrics +- Files: `src/mai/conversation/engine.py` +- Impact: High cognitive load for maintainers; hard to isolate bugs in specific subsystems +- Fix approach: Separate concerns into focused orchestrator (engine) and behavior modules (timing/reasoning/decomposition are already separated but loosely coupled) + +**Permission/Approval System Fragility:** +- Issue: ApprovalSystem uses regex pattern matching for risk analysis with hardcoded patterns +- Files: `src/mai/sandbox/approval_system.py` +- Impact: Pattern-matching approach is fragile (false positives/negatives); patterns not maintainable as code evolves +- Fix approach: Replace regex with AST-based code analysis for more reliable risk detection; move risk patterns to configuration + +**Docker Executor Dependency Chain:** +- Issue: DockerExecutor falls back silently to unavailable state if Docker isn't installed +- Files: `src/mai/sandbox/docker_executor.py` +- Impact: Approval system thinks code is sandboxed when Docker is missing; security false sense of safety +- Fix approach: Require explicit Docker availability check at startup; block code execution if Docker unavailable and user requests sandboxing + +## Known Bugs + +**Session Persistence Restoration:** +- Symptoms: "ConversationState object has no attribute 'set_conversation_history'" error when restarting CLI +- Files: `src/mai/conversation/state.py`, `src/app/__main__.py` +- Trigger: Start conversation, exit CLI, restart CLI session +- Workaround: None - session restoration broken; users lose conversation history +- Status: Identified in Phase 6 UAT but remediation code not applied (commit c70ee88 "Complete fresh slate" removed implementation) + +**Session File Feedback Missing:** +- Symptoms: Users don't see where/when session files are created +- Files: `src/app/__main__.py` +- Trigger: Create new session or use /session command +- Workaround: Manually check ~/.mai/session.json directory +- Status: Identified in Phase 6 UAT as major issue (test 3 failed) + +**Resource Display Color Coding:** +- Symptoms: Resource monitoring displays plain text instead of color-coded status indicators +- Files: `src/app/__main__.py` +- Trigger: Run CLI and observe resource display during conversation +- Workaround: Parse output manually to understand resource status +- Status: Identified in Phase 6 UAT as minor issue (test 5 failed); root cause: Rich console loses color output in non-terminal environments + +## Security Considerations + +**Approval System Risk Analysis Insufficient:** +- Risk: Regex-based risk detection can be bypassed with obfuscated code (e.g., string concatenation to build dangerous commands) +- Files: `src/mai/sandbox/approval_system.py` +- Current mitigation: Hardcoded high-risk patterns (os.system, exec, eval); fallback to block on unrecognized patterns +- Recommendations: + - Implement AST-based code analysis for more reliable detection + - Add code deobfuscation step before risk analysis + - Create risk assessment database with test cases and known bypasses + - Require explicit docker verification before allowing code execution + +**Docker Fallback Security Gap:** +- Risk: Code could execute without actual sandboxing if Docker unavailable, creating false sense of security +- Files: `src/mai/sandbox/docker_executor.py` +- Current mitigation: AuditLogger records all execution; approval system presents requests regardless +- Recommendations: + - Fail-safe: Block code execution if Docker unavailable and user hasn't explicitly allowed non-sandboxed execution + - Add warning dialog explaining sandbox unavailability + - Log all non-sandboxed execution attempts explicitly + - Require explicit override from user with confirmation + +**Approval Preference Learning Risk:** +- Risk: User can set "auto_allow" on risky code patterns; once learned, code execution auto-approves without user intervention +- Files: `src/mai/sandbox/approval_system.py` (lines with `user_preferences` and `auto_allow`) +- Current mitigation: Auto-allow only applies to LOW risk level code +- Recommendations: + - Require explicit user confirmation before enabling auto-allow (not just responding "a") + - Log all auto-approved executions in audit trail with reason + - Add periodic review mechanism for auto-allow rules (e.g., "You have X auto-approved rules, review them?" on startup) + - Restrict auto-allow to strictly limited operation types (print, basic math, not file operations) + +## Performance Bottlenecks + +**Memory Retrieval Search Not Optimized:** +- Problem: ContextRetriever does full database scans for semantic similarity without indexing +- Files: `src/mai/memory/retrieval.py` +- Cause: Vector similarity search likely using brute-force nearest-neighbor without FAISS or similar +- Improvement path: + - Add FAISS vector index for semantic search acceleration + - Implement result caching for frequent queries + - Add search result pagination to avoid loading entire result sets + - Benchmark retrieval latency and set targets (e.g., <500ms for top-10 similar conversations) + +**Conversation State History Accumulation:** +- Problem: ConversationState.conversation_history grows unbounded during long sessions +- Files: `src/mai/conversation/state.py` +- Cause: No automatic truncation or archival of old turns; all conversation turns kept in memory +- Improvement path: + - Implement sliding window of recent turns (e.g., keep last 50 turns in memory) + - Archive old turns to disk and load on demand + - Add compression trigger at configurable message count + - Monitor memory usage and alert when conversation history exceeds threshold + +**Memory Manager Compression Not Scheduled:** +- Problem: Manual `compress_conversation()` calls required; no automatic compression scheduling +- Files: `src/mai/memory/manager.py` +- Cause: Compression is triggered manually or not at all; no background task or event-driven compression +- Improvement path: + - Implement background compression task triggered by conversation age or message count + - Add periodic compression sweep for all old conversations + - Make compression interval configurable (e.g., compress every 500 messages or 24 hours) + - Track compression effectiveness and adjust thresholds + +## Fragile Areas + +**Ollama Integration Dependency:** +- Files: `src/mai/model/ollama_client.py`, `src/mai/core/interface.py` +- Why fragile: Hard-coded Ollama endpoint assumption; no fallback model provider; no retry logic for model inference +- Safe modification: + - Use dependency injection for model provider (interface-based) + - Add configurable model provider endpoints + - Implement retry logic with exponential backoff for transient failures + - Add model availability detection at startup +- Test coverage: Limited tests for model switching and unavailability scenarios + +**Git Integration Fragility:** +- Files: `src/mai/git/committer.py`, `src/mai/git/workflow.py` +- Why fragile: Assumes clean git state; no handling for merge conflicts, detached HEAD, or dirty working directory +- Safe modification: + - Add pre-commit git status validation + - Handle merge conflict detection and defer commits + - Implement conflict resolution strategy (manual review or aborting) + - Test against all git states (detached HEAD, dirty working tree, conflicted merge) +- Test coverage: No tests for edge cases like merge conflicts + +**Conversation State Serialization Round-Trip:** +- Files: `src/mai/conversation/state.py`, `src/mai/models/conversation.py` +- Why fragile: ConversationTurn -> Ollama message -> ConversationTurn conversion can lose context +- Safe modification: + - Add comprehensive unit tests for serialization round-trip + - Document serialization format and invariants + - Add validation after deserialization (verify message count, order, role integrity) + - Create fixture tests with edge cases (unicode, very long messages, special characters) +- Test coverage: No existing tests for message serialization/deserialization + +**Docker Configuration Hardcoding:** +- Files: `src/mai/sandbox/docker_executor.py` +- Why fragile: Docker image names, CPU limits, memory limits hardcoded as class constants +- Safe modification: + - Move Docker config to configuration file + - Add validation on startup that Docker limits match system resources + - Document all Docker configuration assumptions + - Make limits tunable per system resource profile +- Test coverage: Docker integration tests likely mocked; no testing on actual Docker variations + +## Scaling Limits + +**Memory Database Size Growth:** +- Current capacity: SQLite with no explicit limits; storage grows with every conversation +- Limit: SQLite performance degrades significantly above ~1GB; queries become slow +- Scaling path: + - Implement database rotation (archive old conversations, start new DB periodically) + - Add migration path to PostgreSQL for production deployments + - Implement automatic old conversation archival (move to cold storage after 30 days) + - Add database vacuum and index optimization on scheduled basis + +**Conversation Context Window Management:** +- Current capacity: Model context window determined by Ollama model selection (varies) +- Limit: ConversationEngine doesn't prevent context overflow; will fail when history exceeds model limit +- Scaling path: + - Track token count of conversation history and refuse new messages before overflow + - Implement automatic compression trigger at 80% context usage + - Add model switching logic to use larger-context models if available + - Document context budget requirements per model + +**Approval History Unbounded Growth:** +- Current capacity: ApprovalSystem.approval_history list grows indefinitely +- Limit: Memory accumulation over time; each approval decision stored in memory forever +- Scaling path: + - Archive approval history to database after threshold (e.g., 1000 decisions) + - Implement approval history rotation with configurable retention + - Add aggregate statistics (approval patterns) instead of storing raw history + - Clean up approval history on startup or scheduled task + +## Dependencies at Risk + +**Ollama Dependency and Model Availability:** +- Risk: Hard requirement on Ollama being available and having models installed +- Impact: Mai cannot function without Ollama; no fallback to cloud inference or other providers +- Migration plan: + - Implement abstract model provider interface + - Add support for OpenAI/other cloud models as fallback (even if v1 is offline-first) + - Document minimum Ollama model requirements + - Add diagnostic tool to check Ollama health on startup + +**Docker Dependency for Sandboxing:** +- Risk: Docker required for code execution safety; no alternative sandbox implementations +- Impact: Users without Docker can't safely execute generated code; no graceful degradation +- Migration plan: + - Implement abstract executor interface (not just DockerExecutor) + - Add noop executor for testing + - Consider lightweight alternatives (seccomp, chroot, or bubblewrap) for Linux systems + - Add explicit warning if Docker unavailable + +**Rich Library Terminal Detection:** +- Risk: Rich disables colors in non-terminal environments; users see degraded UX +- Impact: Resource monitoring and status displays lack visual feedback in non-terminal contexts +- Migration plan: + - Use Console(force_terminal=True) to force color output when desired + - Add configuration option for color preference + - Implement fallback emoji/unicode indicators for non-color environments + - Test in various terminal emulators and SSH sessions + +## Missing Critical Features + +**Session Data Portability:** +- Problem: Session files are JSON but no export/import mechanism; can't backup or migrate sessions +- Blocks: Users can't back up conversations; losing ~/.mai/session.json loses all context +- Fix: Add export/import commands (/export, /import) and document session file format + +**Conversation Memory Persistence:** +- Problem: Conversation history is session-scoped (stored in memory); not saved to memory system +- Blocks: Long-term pattern learning relies on memory system but conversations aren't automatically stored +- Fix: Implement automatic conversation archival to memory system after session ends + +**User Preference Learning Audit Trail:** +- Problem: User preferences for auto-approval learned silently; no visibility into what patterns auto-approve +- Blocks: Users can't audit their own auto-approval rules; hard to recover from accidentally enabling auto-allow +- Fix: Add /preferences or /audit command to show all learned rules and allow revocation + +**Resource Constraint Graceful Degradation:** +- Problem: System shows resource usage but doesn't adapt model selection or conversation behavior +- Blocks: Mai can't suggest switching to smaller models when resources tight +- Fix: Implement resource-aware model recommendation system + +**Approval Change Logging:** +- Problem: Approval decisions not tracked in git; can't audit "who approved what when" +- Blocks: No accountability trail for approval decisions +- Fix: Log all approval decisions to git with commit messages including timestamp and user + +## Test Coverage Gaps + +**Docker Executor Network Isolation:** +- What's not tested: Whether network actually restricted in Docker containers +- Files: `src/mai/sandbox/docker_executor.py` +- Risk: Code might have network access despite supposed isolation +- Priority: High (security-critical) + +**Session Persistence Edge Cases:** +- What's not tested: Very large conversations (1000+ messages), unicode characters, special characters +- Files: `src/mai/conversation/state.py`, session persistence code +- Risk: Session files corrupt or lose data with edge case inputs +- Priority: High (data loss) + +**Approval System Obfuscation Bypass:** +- What's not tested: Obfuscated code patterns, string concatenation attacks, bytecode approaches +- Files: `src/mai/sandbox/approval_system.py` +- Risk: Risky code could slip through as "low risk" via obfuscation +- Priority: High (security-critical) + +**Memory Compression Round-Trip Data Loss:** +- What's not tested: Whether compressed conversations can be exactly reconstructed +- Files: `src/mai/memory/compression.py`, `src/mai/memory/storage.py` +- Risk: Compression could lose important context patterns; compression metrics may be misleading +- Priority: Medium (data integrity) + +**Model Switching During Active Conversation:** +- What's not tested: Switching models mid-conversation, context migration, embedding space changes +- Files: `src/mai/model/switcher.py`, `src/mai/conversation/engine.py` +- Risk: Context might not transfer correctly when models switch +- Priority: Medium (feature reliability) + +**Offline Queue Conflict Resolution:** +- What's not tested: What happens when offline messages conflict with new context when reconnecting +- Files: `src/mai/conversation/engine.py` (offline queueing) +- Risk: Offline messages might create incoherent conversation when reconnected +- Priority: Medium (conversation coherence) + +**Resource Detector System Resource Edge Cases:** +- What's not tested: GPU detection on systems with unusual hardware, CPU count on virtual systems +- Files: `src/mai/model/resource_detector.py` +- Risk: Wrong model selection due to misdetected resources +- Priority: Low (graceful degradation usually handles this) + +--- + +*Concerns audit: 2026-01-26* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md new file mode 100644 index 0000000..a6b995b --- /dev/null +++ b/.planning/codebase/CONVENTIONS.md @@ -0,0 +1,298 @@ +# Coding Conventions + +**Analysis Date:** 2026-01-26 + +## Status + +**Note:** This codebase is in planning phase. No source code has been written yet. These conventions are **prescriptive** for the Mai project and should be applied to all code from the first commit forward. + +## Naming Patterns + +**Files:** +- Python modules: `lowercase_with_underscores.py` (PEP 8) +- Configuration files: `config.yaml`, `.env.example` +- Test files: `test_module_name.py` (co-located with source) +- Example: `src/memory/storage.py`, `src/memory/test_storage.py` + +**Functions:** +- Use `snake_case` for all function names (PEP 8) +- Private functions: Prefix with single underscore `_private_function()` +- Async functions: Use `async def async_operation()` naming +- Example: `def get_conversation_history()`, `async def stream_response()` + +**Variables:** +- Use `snake_case` for all variable names +- Constants: `UPPERCASE_WITH_UNDERSCORES` +- Private module variables: Prefix with `_` +- Example: `conversation_history`, `MAX_CONTEXT_TOKENS`, `_internal_cache` + +**Types:** +- Classes: `PascalCase` +- Enums: `PascalCase` (inherit from `Enum`) +- TypedDict: `PascalCase` with `Dict` suffix +- Example: `class ConversationManager`, `class ErrorLevel(Enum)`, `class MemoryConfigDict(TypedDict)` + +**Directories:** +- Core modules: `src/[module_name]/` (lowercase, plural when appropriate) +- Example: `src/models/`, `src/memory/`, `src/safety/`, `src/interfaces/` + +## Code Style + +**Formatting:** +- Tool: **Ruff** (formatter and linter) +- Line length: 88 characters (Ruff default) +- Quote style: Double quotes (`"string"`) +- Indentation: 4 spaces (no tabs) + +**Linting:** +- Tool: **Ruff** +- Configuration enforced via `.ruff.toml` (when created) +- All imports must pass ruff checks +- No unused imports allowed +- Type hints required for public functions + +**Python Version:** +- Minimum: Python 3.10+ +- Use modern type hints: `from typing import *` +- Use `str | None` instead of `Optional[str]` (union syntax) + +## Import Organization + +**Order:** +1. Standard library imports (`import os`, `import sys`) +2. Third-party imports (`import discord`, `import numpy`) +3. Local imports (`from src.memory import Storage`) +4. Blank line between each group + +**Example:** +```python +import asyncio +import json +from pathlib import Path +from typing import Optional + +import discord +from dotenv import load_dotenv + +from src.memory import ConversationStorage +from src.models import ModelManager +``` + +**Path Aliases:** +- Use relative imports from `src/` root +- Avoid deep relative imports (no `../../../`) +- Example: `from src.safety import SandboxExecutor` not `from ...safety import SandboxExecutor` + +## Error Handling + +**Patterns:** +- Define domain-specific exceptions in `src/exceptions.py` +- Use exception hierarchy (base `MaiException`, specific subclasses) +- Always include context in exceptions (error code, details, suggestions) +- Example: + +```python +class MaiException(Exception): + """Base exception for Mai framework.""" + def __init__(self, code: str, message: str, details: dict | None = None): + self.code = code + self.message = message + self.details = details or {} + super().__init__(f"[{code}] {message}") + +class ModelError(MaiException): + """Raised when model inference fails.""" + pass + +class MemoryError(MaiException): + """Raised when memory operations fail.""" + pass +``` + +- Log before raising (see Logging section) +- Use context managers for cleanup (async context managers for async code) +- Never catch bare `Exception` - catch specific exceptions + +## Logging + +**Framework:** `logging` module (Python standard library) + +**Patterns:** +- Create logger per module: `logger = logging.getLogger(__name__)` +- Log levels guide: + - `DEBUG`: Detailed diagnostic info (token counts, decision trees) + - `INFO`: Significant operational events (conversation started, model loaded) + - `WARNING`: Unexpected but handled conditions (fallback triggered, retry) + - `ERROR`: Failed operation (model error, memory access failed) + - `CRITICAL`: System-level failures (cannot recover) +- Structured logging preferred (include operation context) +- Example: + +```python +import logging + +logger = logging.getLogger(__name__) + +async def invoke_model(prompt: str, model: str) -> str: + logger.debug(f"Invoking model={model} with token_count={len(prompt.split())}") + try: + response = await model_manager.generate(prompt) + logger.info(f"Model response generated, length={len(response)}") + return response + except ModelError as e: + logger.error(f"Model invocation failed: {e.code}", exc_info=True) + raise +``` + +## Comments + +**When to Comment:** +- Complex logic requiring explanation (multi-step algorithms, non-obvious decisions) +- Important context that code alone cannot convey (why a workaround exists) +- Do NOT comment obvious code (`x = 1 # set x to 1` is noise) +- Do NOT duplicate what the code already says + +**JSDoc/Docstrings:** +- Use Google-style docstrings for all public functions/classes +- Include return type even if type hints exist (for readability) +- Example: + +```python +async def get_memory_context( + query: str, + max_tokens: int = 2000, +) -> str: + """Retrieve relevant memory context for a query. + + Performs vector similarity search on conversation history, + compresses results to fit token budget, and returns formatted context. + + Args: + query: The search query for memory retrieval. + max_tokens: Maximum tokens in returned context (default 2000). + + Returns: + Formatted memory context as markdown-structured string. + + Raises: + MemoryError: If database query fails or storage is corrupted. + """ +``` + +## Function Design + +**Size:** +- Target: Functions under 50 lines (hard limit: 100 lines) +- Break complex logic into smaller helper functions +- One responsibility per function (single responsibility principle) + +**Parameters:** +- Maximum 4 positional parameters +- Use keyword-only arguments for optional params: `def func(required, *, optional=None)` +- Use dataclasses or TypedDict for complex parameter groups +- Example: + +```python +# Good: Clear structure +async def approve_change( + change_id: str, + *, + reviewer_id: str, + decision: Literal["approve", "reject"], + reason: str | None = None, +) -> None: + pass + +# Bad: Too many params +async def approve_change(change_id, reviewer_id, decision, reason, timestamp, context, metadata): + pass +``` + +**Return Values:** +- Explicitly return values (no implicit `None` returns unless documented) +- Use `Optional[T]` or `T | None` in type hints for nullable returns +- Prefer returning data objects over tuples: return `Result` not `(status, data, error)` +- Async functions return awaitable, not callbacks + +## Module Design + +**Exports:** +- Define `__all__` in each module to be explicit about public API +- Example in `src/memory/__init__.py`: + +```python +from src.memory.storage import ConversationStorage +from src.memory.compression import MemoryCompressor + +__all__ = ["ConversationStorage", "MemoryCompressor"] +``` + +**Barrel Files:** +- Use `__init__.py` to export key classes/functions from submodules +- Keep import chains shallow (max 2 levels deep) +- Example structure: + ``` + src/ + ├── memory/ + │ ├── __init__.py (exports Storage, Compressor) + │ ├── storage.py + │ └── compression.py + ``` + +**Async/Await:** +- All I/O operations (database, API calls, file I/O) must be async +- Use `asyncio` for concurrency, not threading +- Async context managers for resource management: + +```python +async def process_request(prompt: str) -> str: + async with model_manager.get_session() as session: + response = await session.generate(prompt) + return response +``` + +## Type Hints + +**Requirements:** +- All public function signatures must have type hints +- Use `from __future__ import annotations` for forward references +- Prefer union syntax: `str | None` over `Optional[str]` +- Use `Literal` for string enums: `Literal["approve", "reject"]` +- Example: + +```python +from __future__ import annotations +from typing import Literal + +def evaluate_risk(code: str) -> Literal["LOW", "MEDIUM", "HIGH", "BLOCKED"]: + """Evaluate code risk level.""" + pass +``` + +## Configuration + +**Pattern:** +- Use YAML for human-editable config files +- Use environment variables for secrets (never commit `.env`) +- Validation at import time (fail fast if config invalid) +- Example: + +```python +# config.py +import os +from pathlib import Path + +class Config: + DEBUG = os.getenv("DEBUG", "false").lower() == "true" + MODELS_PATH = Path(os.getenv("MODELS_PATH", "~/.mai/models")).expanduser() + MAX_CONTEXT_TOKENS = int(os.getenv("MAX_CONTEXT_TOKENS", "8000")) + + # Validate on import + if not MODELS_PATH.exists(): + raise RuntimeError(f"Models path does not exist: {MODELS_PATH}") +``` + +--- + +*Convention guide: 2026-01-26* +*Status: Prescriptive for Mai v1 implementation* diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md new file mode 100644 index 0000000..1d011db --- /dev/null +++ b/.planning/codebase/INTEGRATIONS.md @@ -0,0 +1,129 @@ +# External Integrations + +**Analysis Date:** 2026-01-26 + +## APIs & External Services + +**Model Inference:** +- LMStudio - Local model server for inference and model switching + - SDK/Client: LMStudio Python API + - Auth: None (local service, no authentication required) + - Configuration: model_path env var, endpoint URL + +- Ollama - Alternative local model management system + - SDK/Client: Ollama REST API (HTTP) + - Auth: None (local service) + - Purpose: Model loading, switching, inference with resource detection + +**Communication & Approvals:** +- Discord - Bot interface for conversation and change approvals + - SDK/Client: discord.py library + - Auth: DISCORD_BOT_TOKEN env variable + - Purpose: Multi-turn conversations, approval reactions (thumbs up/down), status updates + +## Data Storage + +**Databases:** +- SQLite3 (local file-based) + - Connection: Local file path, no remote connection + - Client: Python sqlite3 (stdlib) or SQLAlchemy ORM + - Purpose: Persistent conversation history, memory compression, learned patterns + - Location: Local filesystem (.db files) + +**File Storage:** +- Local filesystem only - Git-tracked code changes, conversation history backups +- No cloud storage integration in v1 + +**Caching:** +- In-memory caching for current conversation context +- Redis: Not used in v1 (local-first constraint) +- Model context window management: Token-based cache within model inference + +## Authentication & Identity + +**Auth Provider:** +- Custom local auth - No external identity provider +- Implementation: + - Discord user ID as conversation context identifier + - Optional local password/PIN for CLI access + - No OAuth/cloud identity providers (offline-first requirement) + +## Monitoring & Observability + +**Error Tracking:** +- None (local only, no error reporting service) +- Local audit logging to SQLite instead + +**Logs:** +- File-based logging to `.logs/` directory +- Format: Structured JSON logs with timestamp, level, context +- Rotation: Size-based or time-based rotation strategy +- No external log aggregation (offline-first) + +## CI/CD & Deployment + +**Hosting:** +- Local machine only (desktop/laptop with RTX 3060+) +- No cloud hosting in v1 + +**CI Pipeline:** +- GitHub Actions for Discord webhook on push + - Workflow: `.github/workflows/discord_sync.yml` + - Trigger: Push events + - Action: POST to Discord webhook for notification + +**Git Integration:** +- All Mai's self-modifications committed automatically with git +- Local git repo tracking all code changes +- Commit messages include decision context and review results + +## Environment Configuration + +**Required env vars:** +- `DISCORD_BOT_TOKEN` - Discord bot authentication +- `LMSTUDIO_ENDPOINT` - LMStudio API URL (default: localhost:8000) +- `OLLAMA_ENDPOINT` - Ollama API URL (optional alternative, default: localhost:11434) +- `DISCORD_USER_ID` - User Discord ID for approval requests +- `MEMORY_DB_PATH` - SQLite database file location +- `MODEL_CACHE_DIR` - Directory for model files +- `CPU_CORES_AVAILABLE` - System CPU count for resource management +- `GPU_VRAM_AVAILABLE` - VRAM in GB for model selection +- `SANDBOX_DOCKER_IMAGE` - Docker image ID for code sandbox execution + +**Secrets location:** +- `.env` file (Python-dotenv) for local development +- Environment variables for production/runtime +- Git-ignored: `.env` not committed + +## Webhooks & Callbacks + +**Incoming:** +- Discord message webhooks - Handled by discord.py bot event listeners +- No external webhook endpoints in v1 + +**Outgoing:** +- Discord webhook for git notifications (configured in GitHub Actions) +- Endpoint: Stored in GitHub secrets as WEBHOOK +- Triggered on: git push events +- Payload: Git commit information (author, message, timestamp) + +**Model Callback Handling:** +- LMStudio streaming callbacks for token-by-token responses +- Ollama streaming responses for incremental model output + +## Code Execution Sandbox + +**Sandbox Environment:** +- Docker container with resource limits + - SDK: Docker SDK for Python (docker-py) + - Environment: Isolated Linux container + - Resource limits: CPU cores, RAM, network restrictions + +**Risk Assessment:** +- Multi-level risk evaluation (LOW/MEDIUM/HIGH/BLOCKED) +- AST validation before container execution +- Second-agent review via Claude/OpenCode API + +--- + +*Integration audit: 2026-01-26* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md new file mode 100644 index 0000000..d92b938 --- /dev/null +++ b/.planning/codebase/STACK.md @@ -0,0 +1,93 @@ +# Technology Stack + +**Analysis Date:** 2026-01-26 + +## Languages + +**Primary:** +- Python 3.x - Core Mai agent codebase, local model inference, self-improvement system + +**Secondary:** +- YAML - Configuration files for personality, behavior settings +- JSON - Configuration, metadata, API responses +- SQL - Memory storage and retrieval queries + +## Runtime + +**Environment:** +- Python (local execution, no remote runtime) +- LMStudio or Ollama - Local model inference server + +**Package Manager:** +- pip - Python package management +- Lockfile: requirements.txt or poetry.lock (typical Python approach) + +## Frameworks + +**Core:** +- No web framework for v1 (CLI/Discord only) + +**Model Inference:** +- LMStudio Python SDK - Local model switching and inference +- Ollama API - Alternative local model management per requirements + +**Discord Integration:** +- discord.py - Discord bot API client + +**CLI:** +- Click or Typer - Command-line interface building + +**Testing:** +- pytest - Unit/integration test framework +- pytest-asyncio - Async test support for Discord bot testing + +**Build/Dev:** +- Git - Version control for Mai's own code changes +- Docker - Sandbox execution environment for safety + +## Key Dependencies + +**Critical:** +- LMStudio Python Client - Model loading, switching, inference with token management +- discord.py - Discord bot functionality for approval workflows +- SQLite3 - Lightweight persistent storage (Python stdlib) +- Docker SDK for Python - Sandbox execution management + +**Infrastructure:** +- requests - HTTP client for Discord API fallback and Ollama API communication +- PyYAML - Personality configuration parsing +- pydantic - Data validation for internal structures +- python-dotenv - Environment variable management for secrets +- GitPython - Programmatic git operations for committing self-improvements + +## Configuration + +**Environment:** +- .env file - Discord bot token, model paths, resource thresholds +- environment variables - Runtime configuration loaded at startup +- personality.yaml - Core personality values and learned behavior layers +- config.json - Resource limits, model preferences, memory settings + +**Build:** +- setup.py or pyproject.toml - Package metadata and dependency declaration +- Dockerfile - Sandbox execution environment specification +- .dockerignore - Docker build optimization + +## Platform Requirements + +**Development:** +- Python 3.8+ (for type hints and async/await) +- Git (for version control and self-modification tracking) +- Docker (for sandbox execution environment) +- LMStudio or Ollama running locally (for model inference) + +**Production (Runtime):** +- RTX 3060 GPU minimum (per project constraints) +- 16GB+ RAM (for model loading and context management) +- Linux/macOS/Windows with Python 3.8+ +- Docker daemon (for sandboxed code execution) +- Local LMStudio/Ollama instance (no cloud models) + +--- + +*Stack analysis: 2026-01-26* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md new file mode 100644 index 0000000..0a7835f --- /dev/null +++ b/.planning/codebase/STRUCTURE.md @@ -0,0 +1,258 @@ +# Codebase Structure + +**Analysis Date:** 2026-01-26 + +## Directory Layout + +``` +mai/ +├── src/ +│ ├── __main__.py # CLI entry point +│ ├── mai.py # Core Mai class, orchestration +│ ├── models/ +│ │ ├── __init__.py +│ │ ├── adapter.py # Base model adapter interface +│ │ ├── ollama_adapter.py # Ollama/LMStudio implementation +│ │ ├── model_manager.py # Model selection and switching logic +│ │ └── resource_monitor.py # CPU, RAM, GPU tracking +│ ├── memory/ +│ │ ├── __init__.py +│ │ ├── store.py # SQLite conversation store +│ │ ├── vector_search.py # Semantic similarity search +│ │ ├── compression.py # History compression and summarization +│ │ └── pattern_extractor.py # Learning and pattern recognition +│ ├── conversation/ +│ │ ├── __init__.py +│ │ ├── engine.py # Main conversation orchestration +│ │ ├── context_manager.py # Token budget and window management +│ │ ├── turn_handler.py # Single turn processing +│ │ └── reasoning.py # Reasoning transparency and clarification +│ ├── personality/ +│ │ ├── __init__.py +│ │ ├── core_rules.py # Unshakeable core values enforcement +│ │ ├── learned_behaviors.py # Personality adaptation from interactions +│ │ ├── guardrails.py # Safety constraints and refusal logic +│ │ └── config_loader.py # YAML personality configuration +│ ├── safety/ +│ │ ├── __init__.py +│ │ ├── executor.py # Docker sandbox execution wrapper +│ │ ├── risk_analyzer.py # Risk classification (LOW/MEDIUM/HIGH/BLOCKED) +│ │ ├── ast_validator.py # Syntax and import validation +│ │ └── audit_log.py # Immutable execution history +│ ├── selfmod/ +│ │ ├── __init__.py +│ │ ├── analyzer.py # Code analysis and improvement detection +│ │ ├── generator.py # Improvement code generation +│ │ ├── scheduler.py # Periodic and on-demand analysis trigger +│ │ ├── reviewer.py # Second-agent review coordination +│ │ └── git_manager.py # Git commit integration +│ ├── interfaces/ +│ │ ├── __init__.py +│ │ ├── cli.py # CLI chat interface +│ │ ├── discord_bot.py # Discord bot implementation +│ │ ├── message_handler.py # Shared message processing +│ │ ├── approval_handler.py # Change approval workflow +│ │ └── offline_queue.py # Message queueing during disconnection +│ └── utils/ +│ ├── __init__.py +│ ├── config.py # Configuration loading +│ ├── logging.py # Structured logging setup +│ ├── validators.py # Input validation helpers +│ └── helpers.py # Shared utility functions +├── config/ +│ ├── personality.yaml # Core personality configuration +│ ├── models.yaml # Model definitions and resource limits +│ ├── safety_rules.yaml # Risk assessment rules +│ └── logging.yaml # Logging configuration +├── tests/ +│ ├── unit/ +│ │ ├── test_models.py +│ │ ├── test_memory.py +│ │ ├── test_conversation.py +│ │ ├── test_personality.py +│ │ ├── test_safety.py +│ │ └── test_selfmod.py +│ ├── integration/ +│ │ ├── test_conversation_flow.py +│ │ ├── test_selfmod_workflow.py +│ │ └── test_interfaces.py +│ └── fixtures/ +│ ├── mock_models.py +│ ├── test_data.py +│ └── sample_conversations.json +├── scripts/ +│ ├── setup_ollama.py # Initial model downloading +│ ├── init_db.py # Database schema initialization +│ └── verify_environment.py # Pre-flight checks +├── docker/ +│ └── Dockerfile # Sandbox execution environment +├── .env.example # Environment variables template +├── pyproject.toml # Project metadata and dependencies +├── requirements.txt # Python dependencies +├── pytest.ini # Test configuration +├── Makefile # Development commands +└── README.md # Project overview +``` + +## Directory Purposes + +**src/:** +- Purpose: All application code +- Contains: Python modules organized by architectural layer +- Key files: `mai.py` (core), `__main__.py` (CLI entry) + +**src/models/:** +- Purpose: Model inference abstraction +- Contains: Adapter interfaces, Ollama client, resource monitoring +- Key files: `model_manager.py` (selection logic), `resource_monitor.py` (constraints) + +**src/memory/:** +- Purpose: Persistent storage and retrieval +- Contains: SQLite operations, vector search, compression +- Key files: `store.py` (main interface), `vector_search.py` (semantic search) + +**src/conversation/:** +- Purpose: Multi-turn conversation orchestration +- Contains: Turn handling, context windowing, reasoning transparency +- Key files: `engine.py` (main coordinator), `context_manager.py` (token budget) + +**src/personality/:** +- Purpose: Values enforcement and personality adaptation +- Contains: Core rules, learned behaviors, guardrails +- Key files: `core_rules.py` (unshakeable values), `learned_behaviors.py` (adaptation) + +**src/safety/:** +- Purpose: Code execution sandboxing and risk assessment +- Contains: Docker wrapper, AST validation, risk classification, audit logging +- Key files: `executor.py` (sandbox wrapper), `risk_analyzer.py` (classification) + +**src/selfmod/:** +- Purpose: Autonomous code improvement and review +- Contains: Code analysis, improvement generation, approval workflow +- Key files: `analyzer.py` (detection), `reviewer.py` (second-agent coordination) + +**src/interfaces/:** +- Purpose: External communication adapters +- Contains: CLI handler, Discord bot, approval system +- Key files: `cli.py` (terminal UI), `discord_bot.py` (Discord integration) + +**src/utils/:** +- Purpose: Shared utilities and helpers +- Contains: Configuration loading, logging, validation +- Key files: `config.py` (env/file loading), `logging.py` (structured logs) + +**config/:** +- Purpose: Non-code configuration files +- Contains: YAML personality, models, safety rules definitions +- Key files: `personality.yaml` (core values), `models.yaml` (resource profiles) + +**tests/:** +- Purpose: Test suites organized by type +- Contains: Unit tests (layer isolation), integration tests (flows), fixtures (test data) +- Key files: Each test file mirrors `src/` structure + +**scripts/:** +- Purpose: One-off setup and maintenance scripts +- Contains: Database initialization, environment verification +- Key files: `setup_ollama.py` (first-time model setup) + +**docker/:** +- Purpose: Container configuration for sandboxed execution +- Contains: Dockerfile for isolation environment +- Key files: `Dockerfile` (build recipe) + +## Key File Locations + +**Entry Points:** +- `src/__main__.py`: CLI entry, `python -m mai` launches here +- `src/interfaces/discord_bot.py`: Discord bot main loop +- `src/mai.py`: Core Mai class, system initialization + +**Configuration:** +- `config/personality.yaml`: Core values, interaction patterns, refusal rules +- `config/models.yaml`: Available models, resource requirements, context windows +- `.env.example`: Required environment variables template + +**Core Logic:** +- `src/mai.py`: Main orchestration +- `src/conversation/engine.py`: Conversation turn processing +- `src/selfmod/analyzer.py`: Improvement opportunity detection +- `src/safety/executor.py`: Safe code execution + +**Testing:** +- `tests/unit/`: Layer-isolated tests (no dependencies between layers) +- `tests/integration/`: End-to-end flow tests +- `tests/fixtures/`: Mock objects and test data + +## Naming Conventions + +**Files:** +- Module files: `snake_case.py` (e.g., `model_manager.py`) +- Entry points: `__main__.py` for packages, standalone scripts at package root +- Config files: `snake_case.yaml` (e.g., `personality.yaml`) +- Test files: `test_*.py` (e.g., `test_conversation.py`) + +**Directories:** +- Feature areas: `snake_case` (e.g., `src/selfmod/`) +- No abbreviations except `selfmod` (self-modification) which is project standard +- Each layer is a top-level directory under `src/` + +**Functions/Classes:** +- Classes: `PascalCase` (e.g., `ModelManager`, `ConversationEngine`) +- Functions: `snake_case` (e.g., `generate_response()`, `validate_code()`) +- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_CONTEXT_TOKENS`) +- Private methods/functions: prefix with `_` (e.g., `_internal_method()`) + +**Types:** +- Use type hints throughout: `def process(msg: str) -> str:` +- Complex types in `src/utils/types.py` or local to module + +## Where to Add New Code + +**New Feature (e.g., new communication interface like Slack):** +- Primary code: `src/interfaces/slack_adapter.py` (new adapter following discord_bot.py pattern) +- Tests: `tests/unit/test_slack_adapter.py` and `tests/integration/test_slack_interface.py` +- Configuration: Add to `src/interfaces/__init__.py` imports and `config/interfaces.yaml` if needed +- Entry hook: Modify `src/mai.py` to initialize new adapter + +**New Component/Module (e.g., advanced memory with graph databases):** +- Implementation: `src/memory/graph_store.py` (new module in appropriate layer) +- Interface: Follow existing patterns (e.g., inherit from `src/memory/store.py` base) +- Tests: Corresponding test in `tests/unit/test_memory.py` or new file if complex +- Integration: Modify `src/mai.py` initialization to use new component with feature flag + +**Utilities (e.g., new helper function):** +- Shared helpers: `src/utils/helpers.py` (functions) or new file like `src/utils/math_utils.py` if substantial +- Internal helpers: Keep in the module where used (don't over-extract) +- Tests: Add to `tests/unit/test_utils.py` + +**Configuration:** +- Static rules: Add to appropriate YAML in `config/` +- Dynamic config: Load in `src/utils/config.py` +- Env-driven: Add to `.env.example` with documentation + +## Special Directories + +**tests/fixtures/:** +- Purpose: Reusable test data and mock objects +- Generated: No, hand-created +- Committed: Yes, part of repository + +**config/:** +- Purpose: Non-code configuration +- Generated: No, hand-maintained +- Committed: Yes, except secrets (use `.env`) + +**.env (not committed):** +- Purpose: Local environment overrides and secrets +- Generated: No, copied from `.env.example` and filled locally +- Committed: No (in .gitignore) + +**docker/:** +- Purpose: Sandbox environment for safe execution +- Generated: No, hand-maintained +- Committed: Yes + +--- + +*Structure analysis: 2026-01-26* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md new file mode 100644 index 0000000..39f1000 --- /dev/null +++ b/.planning/codebase/TESTING.md @@ -0,0 +1,415 @@ +# Testing Patterns + +**Analysis Date:** 2026-01-26 + +## Status + +**Note:** This codebase is in planning phase. No tests have been written yet. These patterns are **prescriptive** for the Mai project and should be applied from the first test file forward. + +## Test Framework + +**Runner:** +- **pytest** - Test discovery and execution +- Version: Latest stable (6.x or higher) +- Config: `pytest.ini` or `pyproject.toml` (create with initial setup) + +**Assertion Library:** +- Built-in `assert` statements +- `pytest` fixtures for setup/teardown +- `pytest.raises()` for exception testing + +**Run Commands:** +```bash +pytest # Run all tests in tests/ directory +pytest -v # Verbose output with test names +pytest -k "test_memory" # Run tests matching pattern +pytest --cov=src # Generate coverage report +pytest --cov=src --cov-report=html # Generate HTML coverage +pytest -x # Stop on first failure +pytest -s # Show print output during tests +``` + +## Test File Organization + +**Location:** +- **Co-located pattern**: Test files live next to source files +- Structure: `src/[module]/test_[component].py` +- All tests in a single directory: `tests/` with mirrored structure + +**Recommended pattern for Mai:** +``` +src/ +├── memory/ +│ ├── __init__.py +│ ├── storage.py +│ └── test_storage.py # Co-located tests +├── models/ +│ ├── __init__.py +│ ├── manager.py +│ └── test_manager.py +└── safety/ + ├── __init__.py + ├── sandbox.py + └── test_sandbox.py +``` + +**Naming:** +- Test files: `test_*.py` or `*_test.py` +- Test classes: `TestComponentName` +- Test functions: `test_specific_behavior_with_context` +- Example: `test_retrieves_conversation_history_within_token_limit` + +**Test Organization:** +- One test class per component being tested +- Group related tests in a single class +- One assertion per test (or tightly related assertions) + +## Test Structure + +**Suite Organization:** +```python +import pytest +from src.memory.storage import ConversationStorage + +class TestConversationStorage: + """Test suite for ConversationStorage.""" + + @pytest.fixture + def storage(self) -> ConversationStorage: + """Provide a storage instance for testing.""" + return ConversationStorage(path=":memory:") # Use in-memory DB + + @pytest.fixture + def sample_conversation(self) -> dict: + """Provide sample conversation data.""" + return { + "messages": [ + {"role": "user", "content": "Hello"}, + {"role": "assistant", "content": "Hi there"}, + ] + } + + def test_stores_and_retrieves_conversation(self, storage, sample_conversation): + """Test that conversations can be stored and retrieved.""" + conversation_id = storage.store(sample_conversation) + retrieved = storage.get(conversation_id) + assert retrieved == sample_conversation + + def test_raises_error_on_missing_conversation(self, storage): + """Test that missing conversations raise appropriate error.""" + with pytest.raises(MemoryError): + storage.get("nonexistent_id") +``` + +**Patterns:** + +- **Setup pattern**: Use `@pytest.fixture` for setup, avoid `setUp()` methods +- **Teardown pattern**: Use fixture cleanup (yield pattern) +- **Assertion pattern**: One logical assertion per test (may involve multiple `assert` statements on related data) + +```python +@pytest.fixture +def model_manager(): + """Set up model manager and clean up after test.""" + manager = ModelManager() + manager.initialize() + yield manager + manager.shutdown() # Cleanup + +def test_loads_available_models(model_manager): + """Test model discovery and loading.""" + models = model_manager.list_available() + assert len(models) > 0 + assert all(isinstance(m, str) for m in models) +``` + +## Async Testing + +**Pattern:** +```python +import pytest +import asyncio + +@pytest.mark.asyncio +async def test_async_model_invocation(): + """Test async model inference.""" + manager = ModelManager() + response = await manager.generate("test prompt") + assert len(response) > 0 + assert isinstance(response, str) + +@pytest.mark.asyncio +async def test_concurrent_memory_access(): + """Test that memory handles concurrent access.""" + storage = ConversationStorage() + tasks = [ + storage.store({"id": i, "text": f"msg {i}"}) + for i in range(10) + ] + ids = await asyncio.gather(*tasks) + assert len(ids) == 10 +``` + +- Use `@pytest.mark.asyncio` decorator +- Use `async def` for test function signature +- Use `await` for async calls +- Can mix async fixtures and sync fixtures + +## Mocking + +**Framework:** `unittest.mock` (Python standard library) + +**Patterns:** + +```python +from unittest.mock import Mock, AsyncMock, patch, MagicMock +import pytest + +def test_handles_model_error(): + """Test error handling when model fails.""" + mock_model = Mock() + mock_model.generate.side_effect = RuntimeError("Model offline") + + manager = ModelManager(model=mock_model) + with pytest.raises(ModelError): + manager.invoke("prompt") + +@pytest.mark.asyncio +async def test_retries_on_transient_failure(): + """Test retry logic for transient failures.""" + mock_api = AsyncMock() + mock_api.call.side_effect = [ + Exception("Temporary failure"), + "success" + ] + + result = await retry_with_backoff(mock_api.call, max_retries=2) + assert result == "success" + assert mock_api.call.call_count == 2 + +@patch("src.models.manager.requests.get") +def test_fetches_model_list(mock_get): + """Test fetching model list from API.""" + mock_get.return_value.json.return_value = {"models": ["model1", "model2"]} + + manager = ModelManager() + models = manager.get_remote_models() + assert models == ["model1", "model2"] +``` + +**What to Mock:** +- External API calls (Discord, LMStudio API) +- Database operations (SQLite in production, use in-memory for tests) +- File I/O (use temporary directories) +- Slow operations (model inference can be stubbed) +- System resources (CPU, RAM monitoring) + +**What NOT to Mock:** +- Core business logic (the logic you're testing) +- Data structure operations (dict, list operations) +- Internal module calls within the same component +- Internal helper functions + +## Fixtures and Factories + +**Test Data Pattern:** + +```python +# conftest.py - shared fixtures +import pytest +from pathlib import Path +from src.memory.storage import ConversationStorage + +@pytest.fixture +def temp_db(): + """Provide a temporary SQLite database.""" + db_path = Path("/tmp/test_mai.db") + yield db_path + if db_path.exists(): + db_path.unlink() + +@pytest.fixture +def conversation_factory(): + """Factory for creating test conversations.""" + def _make_conversation(num_messages: int = 3) -> dict: + messages = [] + for i in range(num_messages): + role = "user" if i % 2 == 0 else "assistant" + messages.append({ + "role": role, + "content": f"Message {i+1}", + "timestamp": f"2026-01-26T{i:02d}:00:00Z" + }) + return {"messages": messages} + return _make_conversation + +def test_stores_long_conversation(temp_db, conversation_factory): + """Test storing conversations with many messages.""" + storage = ConversationStorage(path=temp_db) + long_convo = conversation_factory(num_messages=100) + + conv_id = storage.store(long_convo) + retrieved = storage.get(conv_id) + assert len(retrieved["messages"]) == 100 +``` + +**Location:** +- Shared fixtures: `tests/conftest.py` (pytest auto-discovers) +- Component-specific fixtures: In test files or subdirectory `conftest.py` files +- Factories: In `tests/factories.py` or within `conftest.py` + +## Coverage + +**Requirements:** +- **Target: 80% code coverage minimum** for core modules +- Critical paths (safety, memory, inference): 90%+ coverage +- UI/CLI: 70% (lower due to interaction complexity) + +**View Coverage:** +```bash +pytest --cov=src --cov-report=term-missing +pytest --cov=src --cov-report=html +# Then open htmlcov/index.html in browser +``` + +**Configure in `pyproject.toml`:** +```toml +[tool.pytest.ini_options] +testpaths = ["src", "tests"] +addopts = "--cov=src --cov-report=term-missing --cov-report=html" +``` + +## Test Types + +**Unit Tests:** +- Scope: Single function or class method +- Dependencies: Mocked +- Speed: Fast (<100ms per test) +- Location: `test_component.py` in source directory +- Example: `test_tokenizer_splits_input_correctly` + +**Integration Tests:** +- Scope: Multiple components working together +- Dependencies: Real services (in-memory DB, local files) +- Speed: Medium (100ms - 1s per test) +- Location: `tests/integration/test_*.py` +- Example: `test_conversation_engine_with_memory_retrieval` + +```python +# tests/integration/test_conversation_flow.py +@pytest.mark.asyncio +async def test_full_conversation_with_memory(): + """Test complete conversation flow including memory retrieval.""" + memory = ConversationStorage(path=":memory:") + engine = ConversationEngine(memory=memory) + + # Store context + memory.store({"id": "ctx1", "content": "User prefers Python"}) + + # Have conversation + response = await engine.chat("What language should I use?") + + # Verify context was used + assert "Python" in response or "python" in response.lower() +``` + +**E2E Tests:** +- Scope: Full system end-to-end +- Framework: **Not required for v1** (added in v2) +- Would test: CLI input → Model → Discord output +- Deferred until Discord/CLI interfaces complete + +## Common Patterns + +**Error Testing:** +```python +def test_invalid_input_raises_validation_error(): + """Test that validation catches malformed input.""" + with pytest.raises(ValueError) as exc_info: + storage.store({"invalid": "structure"}) + assert "missing required field" in str(exc_info.value) + +def test_logs_error_details(): + """Test that errors log useful debugging info.""" + with patch("src.logger") as mock_logger: + try: + risky_operation() + except OperationError: + pass + mock_logger.error.assert_called_once() + call_args = mock_logger.error.call_args + assert "operation_id" in str(call_args) +``` + +**Performance Testing:** +```python +def test_memory_retrieval_within_performance_budget(benchmark): + """Test that memory queries complete within time budget.""" + storage = ConversationStorage() + query = "what did we discuss earlier" + + result = benchmark(storage.retrieve_similar, query) + assert len(result) > 0 + +# Run with: pytest --benchmark-only +``` + +**Data Validation Testing:** +```python +@pytest.mark.parametrize("input_val,expected", [ + ("hello", "hello"), + ("HELLO", "hello"), + (" hello ", "hello"), + ("", ValueError), +]) +def test_normalizes_input(input_val, expected): + """Test input normalization with multiple cases.""" + if isinstance(expected, type) and issubclass(expected, Exception): + with pytest.raises(expected): + normalize(input_val) + else: + assert normalize(input_val) == expected +``` + +## Configuration + +**pytest.ini (create at project root):** +```ini +[pytest] +testpaths = src tests +addopts = -v --tb=short --strict-markers +markers = + asyncio: marks async tests + slow: marks slow tests + integration: marks integration tests +``` + +**Alternative: pyproject.toml:** +```toml +[tool.pytest.ini_options] +testpaths = ["src", "tests"] +addopts = "-v --tb=short" +markers = [ + "asyncio: async test", + "slow: slow test", + "integration: integration test", +] +``` + +## Test Execution in CI/CD + +**GitHub Actions workflow (when created):** +```yaml +- name: Run tests + run: pytest --cov=src --cov-report=xml + +- name: Upload coverage + uses: codecov/codecov-action@v3 + with: + files: ./coverage.xml +``` + +--- + +*Testing guide: 2026-01-26* +*Status: Prescriptive for Mai v1 implementation*