marker/guarddog-nexus

Fork 0

Files

Marker689 04abe44ab4 refactor: uv-based deps, no nexus auth, LLM retries, lock cleanup, health checks, e2e tests

2026-05-11 19:27:56 +03:00

5.5 KiB

Raw Blame History

GuardDog Nexus - Improvement Plan

Status Check: Completed Analysis

After thoroughly reviewing the codebase, I've identified several areas for improvement. Below is the prioritized plan.

Priority 1: Critical Bug Fixes

1.1 Fix Harvester Race Condition (T4)

File: guarddog_nexus/core/harvester.py Issue: Lines 56-81 have a race condition where the URL lock cleanup happens in a finally block while the lock check happens before acquisition.

Current problematic flow:

# Line 56-65: Check if locked
async with _url_lock:
    if download_url not in _url_locks:
        _url_locks[download_url] = asyncio.Lock()

lock = _url_locks[download_url]
if lock.locked():
    # Skip...
    async with _url_lock:
        _url_locks.pop(download_url, None)  # Line 64
    return None

async with lock:
    # ... work ...
    finally:
        async with _url_lock:
            _url_locks.pop(download_url, None)  # Line 81

Fix: Ensure lock cleanup only happens after successful work completion, not on early returns.

1.2 Fix LLM Response Parsing (T4)

File: guarddog_nexus/core/llm.py Issue: Line 81 assumes body["choices"][0]["message"]["content"] exists without validation.

Fix: Add proper error handling:

try:
    choices = body.get("choices", [])
    if not choices:
        raise ValueError("No choices in response")
    message = choices[0].get("message", {})
    content = message.get("content", "")
    return json.loads(content)
except (KeyError, IndexError, json.JSONDecodeError) as e:
    ...

Priority 2: Reliability Improvements

2.1 Add LLM Retry Logic (T4)

File: guarddog_nexus/core/llm.py Issue: Failed LLM calls have no retry mechanism.

Fix: Add exponential backoff retry:

async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None:
    for attempt in range(max_retries):
        result = await _attempt_analysis(finding_data)
        if result:
            return result
        if attempt < max_retries - 1:
            await asyncio.sleep(2 ** attempt)
    return None

2.2 Add Dependency Health Checks (T4)

File: guarddog_nexus/main.py Issue: No health checks for database or external dependencies.

Fix: Add /health/dependencies endpoint:

@app.get("/health/dependencies")
async def health_dependencies():
    checks = {
        "database": await _check_db_health(),
        "nexus": await _check_nexus_connectivity(),
    }
    status = 200 if all(checks.values()) else 503
    return JSONResponse(status_code=status, content=checks)

2.3 Fix Lock Cleanup (T4)

Files: guarddog_nexus/core/harvester.py, guarddog_nexus/routes/web.py Issue: _url_locks and _llm_locks dicts grow indefinitely.

Fix: Add periodic cleanup using asyncio.create_task():

async def _cleanup_locks():
    while True:
        await asyncio.sleep(3600)  # Every hour
        for key in list(_url_locks.keys()):
            if not _url_locks[key].locked():
                _url_locks.pop(key, None)

Priority 3: Code Quality

3.1 Add Type Hints Consistency (T1)

Files: Multiple files Issue: Inconsistent use of dict vs Dict[str, Any] type hints.

3.2 Add Input Validation (T4)

File: guarddog_nexus/routes/webhooks.py Issue: Limited validation of webhook payload structure.

Fix: Add validation for required fields before processing.

3.3 Add Logging for Security Events (T4)

Files: guarddog_nexus/core/harvester.py, guarddog_nexus/routes/webhooks.py Issue: Security-related events not logged at appropriate levels.

Fix: Add WARNING/CRITICAL logging for:

Failed authentication attempts
Suspicious package patterns
Rate limiting triggers

Priority 4: Feature Enhancements

4.1 Add Scan Status Tracking (T4)

File: guarddog_nexus/core/harvester.py Issue: No visibility into scan progress for long-running packages.

Fix: Add intermediate status updates via WebSocket or polling endpoint.

4.2 Add Configuration Validation on Startup (T1)

File: guarddog_nexus/config.py Issue: Invalid configurations discovered only at runtime.

Fix: Add validation in Config.__post_init__():

def __post_init__(self):
    if not self.nexus_password:
        raise ValueError("NEXUS_PASSWORD is required")
    if self.llm_enabled and not self.llm_api_key:
        raise ValueError("LLM_API_KEY required when LLM_ENABLED=1")

4.3 Add CSV Export with Filters (T4)

Files: guarddog_nexus/routes/api_scans.py, guarddog_nexus/routes/api_packages.py Issue: CSV exports don't support all filter options available in API.

Fix: Sync filter parameters between API and CSV export endpoints.

Implementation Order

Week 1: Priority 1 fixes (race condition, LLM parsing)
Week 2: Priority 2 improvements (retries, health checks, lock cleanup)
Week 3: Priority 3 code quality (type hints, validation, logging)
Week 4: Priority 4 features (status tracking, config validation, CSV filters)

Verification Steps

After each change:

Run ruff check guarddog_nexus tests
Run python3 -m pytest -v (must pass 85 tests)
Run ruff format guarddog_nexus tests
Manual testing with Docker Compose

Risk Assessment

Change	Risk Level	Mitigation
Harvester race condition fix	MEDIUM	Thorough concurrent testing
LLM retry logic	LOW	Ensure idempotency
Health checks	LOW	Graceful degradation
Lock cleanup	LOW	Conservative cleanup intervals

5.5 KiB Raw Blame History