docs(03): create phase plan

Phase 3: Resource Management - 4 plan(s) in 2 wave(s) - 2 parallel, 2 sequential - Ready for execution
2026-01-27 17:58:09 -05:00
parent a37b61acce
commit 1e071398ff
5 changed files with 623 additions and 0 deletions
--- a/.planning/phases/03-resource-management/03-01-PLAN.md
+++ b/.planning/phases/03-resource-management/03-01-PLAN.md
@@ -0,0 +1,113 @@
+---
+phase: 03-resource-management
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified: [pyproject.toml, src/models/resource_monitor.py]
+autonomous: true
+user_setup: []
+
+must_haves:
+  truths:
+    - "Enhanced resource monitor can detect NVIDIA GPU VRAM using pynvml"
+    - "GPU detection falls back gracefully when GPU unavailable"
+    - "Resource monitoring remains cross-platform compatible"
+  artifacts:
+    - path: "src/models/resource_monitor.py"
+      provides: "Enhanced GPU detection with pynvml support"
+      contains: "pynvml"
+      min_lines: 250
+    - path: "pyproject.toml" 
+      provides: "pynvml dependency for GPU monitoring"
+      contains: "pynvml"
+  key_links:
+    - from: "src/models/resource_monitor.py"
+      to: "pynvml library"
+      via: "import pynvml"
+      pattern: "import pynvml"
+---
+
+<objective>
+Enhance GPU detection and monitoring capabilities by integrating pynvml for precise NVIDIA GPU VRAM tracking while maintaining cross-platform compatibility and graceful fallbacks.
+
+Purpose: Provide accurate GPU resource detection for intelligent model selection and proactive scaling decisions.
+Output: Enhanced ResourceMonitor with reliable GPU VRAM monitoring across different hardware configurations.
+</objective>
+
+<execution_context>
+@~/.opencode/get-shit-done/workflows/execute-plan.md
+@~/.opencode/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+
+# Current implementation
+@src/models/resource_monitor.py
+@pyproject.toml
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Add pynvml dependency to project</name>
+  <files>pyproject.toml</files>
+  <action>Add pynvml>=11.0.0 to the main dependencies array in pyproject.toml. This ensures NVIDIA GPU monitoring capabilities are available by default rather than being optional.</action>
+  <verify>grep -n "pynvml" pyproject.toml shows the dependency added correctly</verify>
+  <done>pynvml dependency is available for GPU monitoring</done>
+</task>
+
+<task type="auto">
+  <name>Enhance ResourceMonitor with pynvml GPU detection</name>
+  <files>src/models/resource_monitor.py</files>
+  <action>
+Enhance the _get_gpu_memory() method to use pynvml for precise NVIDIA GPU VRAM detection:
+
+1. Add pynvml import at the top of the file
+2. Replace the current _get_gpu_memory() implementation with pynvml-based detection:
+   - Initialize pynvml with proper error handling
+   - Get GPU handle and memory info using pynvml APIs
+   - Return total, used, and free VRAM in GB
+   - Handle NVMLError gracefully and fallback to existing gpu-tracker logic
+   - Ensure pynvmlShutdown() is always called in finally block
+3. Update get_current_resources() to include detailed GPU info:
+   - gpu_total_vram_gb: Total VRAM capacity
+   - gpu_used_vram_gb: Currently used VRAM
+   - gpu_free_vram_gb: Available VRAM
+   - gpu_utilization_percent: GPU utilization (if available)
+4. Add GPU temperature monitoring if available via pynvml
+5. Maintain backward compatibility with existing return format
+
+The enhanced GPU detection should:
+- Try pynvml first for NVIDIA GPUs
+- Fall back to gpu-tracker for other vendors
+- Return 0 values if no GPU detected
+- Handle all exceptions gracefully
+- Log GPU detection results at debug level
+</action>
+  <verify>python -c "from src.models.resource_monitor import ResourceMonitor; rm = ResourceMonitor(); resources = rm.get_current_resources(); print('GPU detection:', {k: v for k, v in resources.items() if 'gpu' in k})" returns GPU metrics without errors</verify>
+  <done>ResourceMonitor provides accurate GPU VRAM monitoring using pynvml with proper fallbacks</done>
+</task>
+
+</tasks>
+
+<verification>
+Test enhanced resource monitoring across different configurations:
+- Systems with NVIDIA GPUs (pynvml should work)
+- Systems with AMD/Intel GPUs (fallback to gpu-tracker)
+- Systems without GPUs (graceful zero values)
+- Cross-platform compatibility (Linux, Windows, macOS)
+
+Verify monitoring overhead remains < 1% CPU usage.
+</verification>
+
+<success_criteria>
+ResourceMonitor successfully detects and reports GPU VRAM using pynvml when available, falls back gracefully to other methods, maintains cross-platform compatibility, and provides detailed GPU metrics for intelligent model selection.
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/03-resource-management/03-01-SUMMARY.md`
+</output>