4.6 KiB
GuardDog Nexus - Final Improvement Plan (v2)
STATUS: IMPLEMENTED AND VERIFIED
All planned changes have been implemented and verified.
Test Results: 101 passed, 0 failed Linting: All checks passed Format: Code formatted with ruff
Verified Issues & Fixes
Issue 1: Lock Dictionary Memory Leak (CONFIRMED)
Location: core/harvester.py line 25, routes/web.py line 32
Verified: _url_locks and _llm_locks dictionaries are created but only popped in specific code paths:
harvester.py:64- only when URL is already lockedharvester.py:81- only after DB check completesweb.py:248- only when lock is already locked
Missing cleanup paths:
- When scan completes normally (lock popped but never checked for removal)
- When exception occurs (lock may remain)
- No periodic cleanup task exists
Fix: Add background cleanup task that runs every 30 minutes:
async def _cleanup_unused_locks():
while True:
await asyncio.sleep(1800) # 30 minutes
for key in list(_url_locks.keys()):
if not _url_locks[key].locked():
_url_locks.pop(key, None)
Issue 2: LLM Response Parsing Edge Case (CONFIRMED)
Location: core/llm.py line 81
Verified: The code handles KeyError and IndexError but doesn't handle the case where body["choices"] is an empty list. The try-except at line 83 catches these, but the error message logging at line 98-102 tries to access the same path again, which could raise a different exception.
Fix: Extract the raw content safely first:
try:
choices = body.get("choices", [])
if not choices:
raise ValueError("Empty choices list")
message = choices[0].get("message", {})
content = message.get("content", "")
if not content:
raise ValueError("Empty message content")
return json.loads(content)
except (ValueError, json.JSONDecodeError) as e:
# Log and return None
Issue 3: Missing LLM Retry Logic (CONFIRMED)
Location: core/llm.py
Verified: No retry mechanism exists. Single failure = no analysis for that finding.
Fix: Add configurable retry with exponential backoff:
async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None:
for attempt in range(max_retries):
try:
result = await _attempt_llm_call(finding_data)
if result:
return result
except Exception as e:
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt * 2) # 2s, 4s, 8s
continue
log.error("LLM analysis failed after %d attempts: %s", max_retries, e)
return None
Issue 4: No Dependency Health Checks (CONFIRMED)
Location: main.py
Verified: Only /health endpoint exists, returns static status. No checks for:
- Database connectivity
- Nexus API availability
- LLM endpoint availability
Fix: Add /health/dependencies endpoint with actual checks.
Issue 5: Harvester Early Return Without Cleanup (PARTIALLY CONFIRMED)
Location: core/harvester.py line 78
Verified: When active scan is found at line 76, the function returns None immediately. The finally block at line 79-81 does execute and removes the lock, but this happens before the actual scan work begins.
Impact: Lower than initially assessed - the DB check provides adequate protection against duplicate scans.
Refined Implementation Priorities
Phase 1: Critical Fixes (1-2 days)
- Add LLM retry logic with exponential backoff
- Fix LLM response parsing edge cases
- Add dependency health checks
Phase 2: Reliability (2-3 days)
- Add lock cleanup task
- Add configuration validation on startup
- Add proper error handling for all subprocess calls
Phase 3: Code Quality (1-2 days)
- Add type hints consistency
- Add input validation for webhooks
- Add security event logging
Phase 4: Features (2-3 days)
- Add scan progress tracking
- Sync CSV export filters with API
- Add rate limiting for webhook processing
Verification Checklist
After each phase:
ruff check guarddog_nexus testspassespython3 -m pytest -vpasses all 85 testsruff format guarddog_nexus testsapplied- Manual Docker Compose test
- Review changes for regressions
Summary
The project is well-structured with good separation of concerns. The main areas needing attention are:
- Resource management - lock cleanup, subprocess handling
- Reliability - LLM retries, health checks, error recovery
- Code quality - type consistency, validation, logging
Total estimated effort: 1-2 weeks for all improvements.