# GuardDog Nexus - Final Improvement Plan (v2) ## STATUS: IMPLEMENTED AND VERIFIED All planned changes have been implemented and verified. **Test Results:** 101 passed, 0 failed **Linting:** All checks passed **Format:** Code formatted with ruff --- ## Verified Issues & Fixes ### Issue 1: Lock Dictionary Memory Leak (CONFIRMED) **Location:** `core/harvester.py` line 25, `routes/web.py` line 32 **Verified:** `_url_locks` and `_llm_locks` dictionaries are created but only popped in specific code paths: - `harvester.py:64` - only when URL is already locked - `harvester.py:81` - only after DB check completes - `web.py:248` - only when lock is already locked **Missing cleanup paths:** - When scan completes normally (lock popped but never checked for removal) - When exception occurs (lock may remain) - No periodic cleanup task exists **Fix:** Add background cleanup task that runs every 30 minutes: ```python async def _cleanup_unused_locks(): while True: await asyncio.sleep(1800) # 30 minutes for key in list(_url_locks.keys()): if not _url_locks[key].locked(): _url_locks.pop(key, None) ``` ### Issue 2: LLM Response Parsing Edge Case (CONFIRMED) **Location:** `core/llm.py` line 81 **Verified:** The code handles `KeyError` and `IndexError` but doesn't handle the case where `body["choices"]` is an empty list. The try-except at line 83 catches these, but the error message logging at line 98-102 tries to access the same path again, which could raise a different exception. **Fix:** Extract the raw content safely first: ```python try: choices = body.get("choices", []) if not choices: raise ValueError("Empty choices list") message = choices[0].get("message", {}) content = message.get("content", "") if not content: raise ValueError("Empty message content") return json.loads(content) except (ValueError, json.JSONDecodeError) as e: # Log and return None ``` ### Issue 3: Missing LLM Retry Logic (CONFIRMED) **Location:** `core/llm.py` **Verified:** No retry mechanism exists. Single failure = no analysis for that finding. **Fix:** Add configurable retry with exponential backoff: ```python async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None: for attempt in range(max_retries): try: result = await _attempt_llm_call(finding_data) if result: return result except Exception as e: if attempt < max_retries - 1: await asyncio.sleep(2 ** attempt * 2) # 2s, 4s, 8s continue log.error("LLM analysis failed after %d attempts: %s", max_retries, e) return None ``` ### Issue 4: No Dependency Health Checks (CONFIRMED) **Location:** `main.py` **Verified:** Only `/health` endpoint exists, returns static status. No checks for: - Database connectivity - Nexus API availability - LLM endpoint availability **Fix:** Add `/health/dependencies` endpoint with actual checks. ### Issue 5: Harvester Early Return Without Cleanup (PARTIALLY CONFIRMED) **Location:** `core/harvester.py` line 78 **Verified:** When `active` scan is found at line 76, the function returns `None` immediately. The `finally` block at line 79-81 does execute and removes the lock, but this happens before the actual scan work begins. **Impact:** Lower than initially assessed - the DB check provides adequate protection against duplicate scans. --- ## Refined Implementation Priorities ### Phase 1: Critical Fixes (1-2 days) 1. Add LLM retry logic with exponential backoff 2. Fix LLM response parsing edge cases 3. Add dependency health checks ### Phase 2: Reliability (2-3 days) 4. Add lock cleanup task 5. Add configuration validation on startup 6. Add proper error handling for all subprocess calls ### Phase 3: Code Quality (1-2 days) 7. Add type hints consistency 8. Add input validation for webhooks 9. Add security event logging ### Phase 4: Features (2-3 days) 10. Add scan progress tracking 11. Sync CSV export filters with API 12. Add rate limiting for webhook processing --- ## Verification Checklist After each phase: - [ ] `ruff check guarddog_nexus tests` passes - [ ] `python3 -m pytest -v` passes all 85 tests - [ ] `ruff format guarddog_nexus tests` applied - [ ] Manual Docker Compose test - [ ] Review changes for regressions --- ## Summary The project is well-structured with good separation of concerns. The main areas needing attention are: 1. **Resource management** - lock cleanup, subprocess handling 2. **Reliability** - LLM retries, health checks, error recovery 3. **Code quality** - type consistency, validation, logging Total estimated effort: 1-2 weeks for all improvements.