Phase 3: Resource Management - 4 plan(s) in 2 wave(s) - 2 parallel, 2 sequential - Ready for execution
This commit is contained in:
113
.planning/phases/03-resource-management/03-01-PLAN.md
Normal file
113
.planning/phases/03-resource-management/03-01-PLAN.md
Normal file
@@ -0,0 +1,113 @@
|
||||
---
|
||||
phase: 03-resource-management
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified: [pyproject.toml, src/models/resource_monitor.py]
|
||||
autonomous: true
|
||||
user_setup: []
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Enhanced resource monitor can detect NVIDIA GPU VRAM using pynvml"
|
||||
- "GPU detection falls back gracefully when GPU unavailable"
|
||||
- "Resource monitoring remains cross-platform compatible"
|
||||
artifacts:
|
||||
- path: "src/models/resource_monitor.py"
|
||||
provides: "Enhanced GPU detection with pynvml support"
|
||||
contains: "pynvml"
|
||||
min_lines: 250
|
||||
- path: "pyproject.toml"
|
||||
provides: "pynvml dependency for GPU monitoring"
|
||||
contains: "pynvml"
|
||||
key_links:
|
||||
- from: "src/models/resource_monitor.py"
|
||||
to: "pynvml library"
|
||||
via: "import pynvml"
|
||||
pattern: "import pynvml"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Enhance GPU detection and monitoring capabilities by integrating pynvml for precise NVIDIA GPU VRAM tracking while maintaining cross-platform compatibility and graceful fallbacks.
|
||||
|
||||
Purpose: Provide accurate GPU resource detection for intelligent model selection and proactive scaling decisions.
|
||||
Output: Enhanced ResourceMonitor with reliable GPU VRAM monitoring across different hardware configurations.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@~/.opencode/get-shit-done/workflows/execute-plan.md
|
||||
@~/.opencode/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
|
||||
# Current implementation
|
||||
@src/models/resource_monitor.py
|
||||
@pyproject.toml
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Add pynvml dependency to project</name>
|
||||
<files>pyproject.toml</files>
|
||||
<action>Add pynvml>=11.0.0 to the main dependencies array in pyproject.toml. This ensures NVIDIA GPU monitoring capabilities are available by default rather than being optional.</action>
|
||||
<verify>grep -n "pynvml" pyproject.toml shows the dependency added correctly</verify>
|
||||
<done>pynvml dependency is available for GPU monitoring</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Enhance ResourceMonitor with pynvml GPU detection</name>
|
||||
<files>src/models/resource_monitor.py</files>
|
||||
<action>
|
||||
Enhance the _get_gpu_memory() method to use pynvml for precise NVIDIA GPU VRAM detection:
|
||||
|
||||
1. Add pynvml import at the top of the file
|
||||
2. Replace the current _get_gpu_memory() implementation with pynvml-based detection:
|
||||
- Initialize pynvml with proper error handling
|
||||
- Get GPU handle and memory info using pynvml APIs
|
||||
- Return total, used, and free VRAM in GB
|
||||
- Handle NVMLError gracefully and fallback to existing gpu-tracker logic
|
||||
- Ensure pynvmlShutdown() is always called in finally block
|
||||
3. Update get_current_resources() to include detailed GPU info:
|
||||
- gpu_total_vram_gb: Total VRAM capacity
|
||||
- gpu_used_vram_gb: Currently used VRAM
|
||||
- gpu_free_vram_gb: Available VRAM
|
||||
- gpu_utilization_percent: GPU utilization (if available)
|
||||
4. Add GPU temperature monitoring if available via pynvml
|
||||
5. Maintain backward compatibility with existing return format
|
||||
|
||||
The enhanced GPU detection should:
|
||||
- Try pynvml first for NVIDIA GPUs
|
||||
- Fall back to gpu-tracker for other vendors
|
||||
- Return 0 values if no GPU detected
|
||||
- Handle all exceptions gracefully
|
||||
- Log GPU detection results at debug level
|
||||
</action>
|
||||
<verify>python -c "from src.models.resource_monitor import ResourceMonitor; rm = ResourceMonitor(); resources = rm.get_current_resources(); print('GPU detection:', {k: v for k, v in resources.items() if 'gpu' in k})" returns GPU metrics without errors</verify>
|
||||
<done>ResourceMonitor provides accurate GPU VRAM monitoring using pynvml with proper fallbacks</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
Test enhanced resource monitoring across different configurations:
|
||||
- Systems with NVIDIA GPUs (pynvml should work)
|
||||
- Systems with AMD/Intel GPUs (fallback to gpu-tracker)
|
||||
- Systems without GPUs (graceful zero values)
|
||||
- Cross-platform compatibility (Linux, Windows, macOS)
|
||||
|
||||
Verify monitoring overhead remains < 1% CPU usage.
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
ResourceMonitor successfully detects and reports GPU VRAM using pynvml when available, falls back gracefully to other methods, maintains cross-platform compatibility, and provides detailed GPU metrics for intelligent model selection.
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/03-resource-management/03-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user