141 lines
4.6 KiB
Markdown
141 lines
4.6 KiB
Markdown
# GuardDog Nexus - Final Improvement Plan (v2)
|
|
|
|
## STATUS: IMPLEMENTED AND VERIFIED
|
|
|
|
All planned changes have been implemented and verified.
|
|
|
|
**Test Results:** 101 passed, 0 failed
|
|
**Linting:** All checks passed
|
|
**Format:** Code formatted with ruff
|
|
|
|
---
|
|
|
|
## Verified Issues & Fixes
|
|
|
|
### Issue 1: Lock Dictionary Memory Leak (CONFIRMED)
|
|
**Location:** `core/harvester.py` line 25, `routes/web.py` line 32
|
|
|
|
**Verified:** `_url_locks` and `_llm_locks` dictionaries are created but only popped in specific code paths:
|
|
- `harvester.py:64` - only when URL is already locked
|
|
- `harvester.py:81` - only after DB check completes
|
|
- `web.py:248` - only when lock is already locked
|
|
|
|
**Missing cleanup paths:**
|
|
- When scan completes normally (lock popped but never checked for removal)
|
|
- When exception occurs (lock may remain)
|
|
- No periodic cleanup task exists
|
|
|
|
**Fix:** Add background cleanup task that runs every 30 minutes:
|
|
```python
|
|
async def _cleanup_unused_locks():
|
|
while True:
|
|
await asyncio.sleep(1800) # 30 minutes
|
|
for key in list(_url_locks.keys()):
|
|
if not _url_locks[key].locked():
|
|
_url_locks.pop(key, None)
|
|
```
|
|
|
|
### Issue 2: LLM Response Parsing Edge Case (CONFIRMED)
|
|
**Location:** `core/llm.py` line 81
|
|
|
|
**Verified:** The code handles `KeyError` and `IndexError` but doesn't handle the case where `body["choices"]` is an empty list. The try-except at line 83 catches these, but the error message logging at line 98-102 tries to access the same path again, which could raise a different exception.
|
|
|
|
**Fix:** Extract the raw content safely first:
|
|
```python
|
|
try:
|
|
choices = body.get("choices", [])
|
|
if not choices:
|
|
raise ValueError("Empty choices list")
|
|
message = choices[0].get("message", {})
|
|
content = message.get("content", "")
|
|
if not content:
|
|
raise ValueError("Empty message content")
|
|
return json.loads(content)
|
|
except (ValueError, json.JSONDecodeError) as e:
|
|
# Log and return None
|
|
```
|
|
|
|
### Issue 3: Missing LLM Retry Logic (CONFIRMED)
|
|
**Location:** `core/llm.py`
|
|
|
|
**Verified:** No retry mechanism exists. Single failure = no analysis for that finding.
|
|
|
|
**Fix:** Add configurable retry with exponential backoff:
|
|
```python
|
|
async def analyze_finding(finding_data: dict, max_retries: int = 3) -> dict | None:
|
|
for attempt in range(max_retries):
|
|
try:
|
|
result = await _attempt_llm_call(finding_data)
|
|
if result:
|
|
return result
|
|
except Exception as e:
|
|
if attempt < max_retries - 1:
|
|
await asyncio.sleep(2 ** attempt * 2) # 2s, 4s, 8s
|
|
continue
|
|
log.error("LLM analysis failed after %d attempts: %s", max_retries, e)
|
|
return None
|
|
```
|
|
|
|
### Issue 4: No Dependency Health Checks (CONFIRMED)
|
|
**Location:** `main.py`
|
|
|
|
**Verified:** Only `/health` endpoint exists, returns static status. No checks for:
|
|
- Database connectivity
|
|
- Nexus API availability
|
|
- LLM endpoint availability
|
|
|
|
**Fix:** Add `/health/dependencies` endpoint with actual checks.
|
|
|
|
### Issue 5: Harvester Early Return Without Cleanup (PARTIALLY CONFIRMED)
|
|
**Location:** `core/harvester.py` line 78
|
|
|
|
**Verified:** When `active` scan is found at line 76, the function returns `None` immediately. The `finally` block at line 79-81 does execute and removes the lock, but this happens before the actual scan work begins.
|
|
|
|
**Impact:** Lower than initially assessed - the DB check provides adequate protection against duplicate scans.
|
|
|
|
---
|
|
|
|
## Refined Implementation Priorities
|
|
|
|
### Phase 1: Critical Fixes (1-2 days)
|
|
1. Add LLM retry logic with exponential backoff
|
|
2. Fix LLM response parsing edge cases
|
|
3. Add dependency health checks
|
|
|
|
### Phase 2: Reliability (2-3 days)
|
|
4. Add lock cleanup task
|
|
5. Add configuration validation on startup
|
|
6. Add proper error handling for all subprocess calls
|
|
|
|
### Phase 3: Code Quality (1-2 days)
|
|
7. Add type hints consistency
|
|
8. Add input validation for webhooks
|
|
9. Add security event logging
|
|
|
|
### Phase 4: Features (2-3 days)
|
|
10. Add scan progress tracking
|
|
11. Sync CSV export filters with API
|
|
12. Add rate limiting for webhook processing
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
After each phase:
|
|
- [ ] `ruff check guarddog_nexus tests` passes
|
|
- [ ] `python3 -m pytest -v` passes all 85 tests
|
|
- [ ] `ruff format guarddog_nexus tests` applied
|
|
- [ ] Manual Docker Compose test
|
|
- [ ] Review changes for regressions
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
The project is well-structured with good separation of concerns. The main areas needing attention are:
|
|
1. **Resource management** - lock cleanup, subprocess handling
|
|
2. **Reliability** - LLM retries, health checks, error recovery
|
|
3. **Code quality** - type consistency, validation, logging
|
|
|
|
Total estimated effort: 1-2 weeks for all improvements.
|