Files
guarddog-nexus/.opencode/plans/final-plan.md

4.6 KiB

GuardDog Nexus - Final Improvement Plan (v2)

STATUS: IMPLEMENTED AND VERIFIED

All planned changes have been implemented and verified.

Test Results: 101 passed, 0 failed Linting: All checks passed Format: Code formatted with ruff


Verified Issues & Fixes

Issue 1: Lock Dictionary Memory Leak (CONFIRMED)

Location: core/harvester.py line 25, routes/web.py line 32

Verified: _url_locks and _llm_locks dictionaries are created but only popped in specific code paths:

  • harvester.py:64 - only when URL is already locked
  • harvester.py:81 - only after DB check completes
  • web.py:248 - only when lock is already locked

Missing cleanup paths:

  • When scan completes normally (lock popped but never checked for removal)
  • When exception occurs (lock may remain)
  • No periodic cleanup task exists

Fix: Add background cleanup task that runs every 30 minutes:

async def _cleanup_unused_locks():
    while True:
        await asyncio.sleep(1800)  # 30 minutes
        for key in list(_url_locks.keys()):
            if not _url_locks[key].locked():
                _url_locks.pop(key, None)

Issue 2: LLM Response Parsing Edge Case (CONFIRMED)

Location: core/llm.py line 81

Verified: The code handles KeyError and IndexError but doesn't handle the case where body["choices"] is an empty list. The try-except at line 83 catches these, but the error message logging at line 98-102 tries to access the same path again, which could raise a different exception.

Fix: Extract the raw content safely first:

try:
    choices = body.get("choices", [])
    if not choices:
        raise ValueError("Empty choices list")
    message = choices[0].get("message", {})
    content = message.get("content", "")
    if not content:
        raise ValueError("Empty message content")
    return json.loads(content)
except (ValueError, json.JSONDecodeError) as e:
    # Log and return None

Issue 3: Missing LLM Retry Logic (CONFIRMED)

Location: core/llm.py

Verified: No retry mechanism exists. Single failure = no analysis for that finding.

Fix: Add configurable retry with exponential backoff:

async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None:
    for attempt in range(max_retries):
        try:
            result = await _attempt_llm_call(finding_data)
            if result:
                return result
        except Exception as e:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt * 2)  # 2s, 4s, 8s
                continue
            log.error("LLM analysis failed after %d attempts: %s", max_retries, e)
    return None

Issue 4: No Dependency Health Checks (CONFIRMED)

Location: main.py

Verified: Only /health endpoint exists, returns static status. No checks for:

  • Database connectivity
  • Nexus API availability
  • LLM endpoint availability

Fix: Add /health/dependencies endpoint with actual checks.

Issue 5: Harvester Early Return Without Cleanup (PARTIALLY CONFIRMED)

Location: core/harvester.py line 78

Verified: When active scan is found at line 76, the function returns None immediately. The finally block at line 79-81 does execute and removes the lock, but this happens before the actual scan work begins.

Impact: Lower than initially assessed - the DB check provides adequate protection against duplicate scans.


Refined Implementation Priorities

Phase 1: Critical Fixes (1-2 days)

  1. Add LLM retry logic with exponential backoff
  2. Fix LLM response parsing edge cases
  3. Add dependency health checks

Phase 2: Reliability (2-3 days)

  1. Add lock cleanup task
  2. Add configuration validation on startup
  3. Add proper error handling for all subprocess calls

Phase 3: Code Quality (1-2 days)

  1. Add type hints consistency
  2. Add input validation for webhooks
  3. Add security event logging

Phase 4: Features (2-3 days)

  1. Add scan progress tracking
  2. Sync CSV export filters with API
  3. Add rate limiting for webhook processing

Verification Checklist

After each phase:

  • ruff check guarddog_nexus tests passes
  • python3 -m pytest -v passes all 85 tests
  • ruff format guarddog_nexus tests applied
  • Manual Docker Compose test
  • Review changes for regressions

Summary

The project is well-structured with good separation of concerns. The main areas needing attention are:

  1. Resource management - lock cleanup, subprocess handling
  2. Reliability - LLM retries, health checks, error recovery
  3. Code quality - type consistency, validation, logging

Total estimated effort: 1-2 weeks for all improvements.