marker/guarddog-nexus

Fork 0

Files

Marker689 04abe44ab4 refactor: uv-based deps, no nexus auth, LLM retries, lock cleanup, health checks, e2e tests

2026-05-11 19:27:56 +03:00

4.6 KiB

Raw Blame History

GuardDog Nexus - Final Improvement Plan (v2)

STATUS: IMPLEMENTED AND VERIFIED

All planned changes have been implemented and verified.

Test Results: 101 passed, 0 failed Linting: All checks passed Format: Code formatted with ruff

Verified Issues & Fixes

Issue 1: Lock Dictionary Memory Leak (CONFIRMED)

Location: core/harvester.py line 25, routes/web.py line 32

Verified: _url_locks and _llm_locks dictionaries are created but only popped in specific code paths:

harvester.py:64 - only when URL is already locked
harvester.py:81 - only after DB check completes
web.py:248 - only when lock is already locked

Missing cleanup paths:

When scan completes normally (lock popped but never checked for removal)
When exception occurs (lock may remain)
No periodic cleanup task exists

Fix: Add background cleanup task that runs every 30 minutes:

async def _cleanup_unused_locks():
    while True:
        await asyncio.sleep(1800)  # 30 minutes
        for key in list(_url_locks.keys()):
            if not _url_locks[key].locked():
                _url_locks.pop(key, None)

Issue 2: LLM Response Parsing Edge Case (CONFIRMED)

Location: core/llm.py line 81

Verified: The code handles KeyError and IndexError but doesn't handle the case where body["choices"] is an empty list. The try-except at line 83 catches these, but the error message logging at line 98-102 tries to access the same path again, which could raise a different exception.

Fix: Extract the raw content safely first:

try:
    choices = body.get("choices", [])
    if not choices:
        raise ValueError("Empty choices list")
    message = choices[0].get("message", {})
    content = message.get("content", "")
    if not content:
        raise ValueError("Empty message content")
    return json.loads(content)
except (ValueError, json.JSONDecodeError) as e:
    # Log and return None

Issue 3: Missing LLM Retry Logic (CONFIRMED)

Location: core/llm.py

Verified: No retry mechanism exists. Single failure = no analysis for that finding.

Fix: Add configurable retry with exponential backoff:

async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None:
    for attempt in range(max_retries):
        try:
            result = await _attempt_llm_call(finding_data)
            if result:
                return result
        except Exception as e:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt * 2)  # 2s, 4s, 8s
                continue
            log.error("LLM analysis failed after %d attempts: %s", max_retries, e)
    return None

Issue 4: No Dependency Health Checks (CONFIRMED)

Location: main.py

Verified: Only /health endpoint exists, returns static status. No checks for:

Database connectivity
Nexus API availability
LLM endpoint availability

Fix: Add /health/dependencies endpoint with actual checks.

Issue 5: Harvester Early Return Without Cleanup (PARTIALLY CONFIRMED)

Location: core/harvester.py line 78

Verified: When active scan is found at line 76, the function returns None immediately. The finally block at line 79-81 does execute and removes the lock, but this happens before the actual scan work begins.

Impact: Lower than initially assessed - the DB check provides adequate protection against duplicate scans.

Refined Implementation Priorities

Phase 1: Critical Fixes (1-2 days)

Add LLM retry logic with exponential backoff
Fix LLM response parsing edge cases
Add dependency health checks

Phase 2: Reliability (2-3 days)

Add lock cleanup task
Add configuration validation on startup
Add proper error handling for all subprocess calls

Phase 3: Code Quality (1-2 days)

Add type hints consistency
Add input validation for webhooks
Add security event logging

Phase 4: Features (2-3 days)

Add scan progress tracking
Sync CSV export filters with API
Add rate limiting for webhook processing

Verification Checklist

After each phase:

ruff check guarddog_nexus tests passes
python3 -m pytest -v passes all 85 tests
ruff format guarddog_nexus tests applied
Manual Docker Compose test
Review changes for regressions

Summary

The project is well-structured with good separation of concerns. The main areas needing attention are:

Resource management - lock cleanup, subprocess handling
Reliability - LLM retries, health checks, error recovery
Code quality - type consistency, validation, logging

Total estimated effort: 1-2 weeks for all improvements.

4.6 KiB Raw Blame History