Files
budget/.plans/hardening_050626.md
T
Spencer Twaddle ac3dcc2f31 Security & resource hardening: eliminate CPU/disk attack surface
Addresses production CPU spike incident. Key changes:

- Guard OTel exporter behind OTEL_EXPORTER_OTLP_ENDPOINT env var; filter
  tracing to /api paths only — unconditional export was primary suspect
- Remove /healthz endpoint entirely (unauthenticated, hit DB on every call)
- Replace KnownUserMiddleware with POST /api/users/me called once on login
  from TokenSync — eliminates unconditional DB write on every request
- Add DB indexes: (BudgetId, IsDeleted) on Incomes/Outgos, OwnerUserId on
  Budgets, SharedWithUserId and (IsPending, SharedWithEmail) on BudgetShares
- Move UseRateLimiter() before UseStaticFiles() so all requests are throttled
- Replace full-array reorder with move-by-position (id + newIndex) — bounded
  input, fewer DB writes, better API design
- Lock ForwardedHeaders to 172.20.0.0/16 subnet; fixes KnownNetworks
  deprecation warning (0 warnings in build now)
- Add AsNoTracking() to all read-only queries in Summary/Incomes/OutgosController
- FrequencyCalculator returns 0 for unknown enum values instead of throwing
- Thread.Sleep → await Task.Delay in OIDC startup loop
- AllowedHosts locked to budget.stwaddle.com

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 22:17:18 -05:00

139 lines
8.2 KiB
Markdown

# Security & Resource Hardening Plan
**Date:** 2026-05-06
**Trigger:** Production host CPU pegged >100% for several hours; Disk I/O high; network normal.
**Root cause hypothesis:** C-1 (unconditional OTel exporter) + H-1 (unconditional DB write per request).
---
## Critical
### ~~C-1 — Guard `UseOtlpExporter()` Behind Env Var Check~~ ✅ DONE
**Resolved 2026-05-06:** Entire OTel block (logging + tracing + exporter) is now gated on `OTEL_EXPORTER_OTLP_ENDPOINT` being set. AspNetCore instrumentation also filtered to `/api` paths only so static file requests never generate spans. The duplicate `builder.Logging.AddOpenTelemetry()` is inside the same guard, eliminating the double-pipeline issue. Build verified clean.
---
### ~~C-2 — Require Auth on `/healthz` (or Remove DB Check)~~ ✅ DONE
**Resolved 2026-05-06:** Endpoint removed entirely. Not wired to any orchestrator health check; budget app is lower priority than recipes/journal. `AddHealthChecks()`, `AddDbContextCheck<AppDbContext>()`, `MapHealthChecks()`, and both health-check `using` statements removed from `Program.cs`. Build verified clean.
---
### ~~C-3 — Move `UseRateLimiter()` Before `UseStaticFiles()`~~ ✅ DONE
**File:** `src/Budget.Api/Program.cs`
**Risk:** `UseStaticFiles` short-circuits the pipeline before `UseRateLimiter`, so all SPA asset requests (JS bundles, CSS, `index.html` fallback) bypass rate limiting. Combined with C-1, every unthrottled static file request still generates a telemetry span.
**Fix:** Reorder middleware so `app.UseRateLimiter()` appears before `app.UseDefaultFiles()` and `app.UseStaticFiles()`.
---
## High
### ~~H-1 — Throttle `KnownUserMiddleware` DB Writes~~ ✅ DONE
**Resolved 2026-05-06:** `KnownUserMiddleware` removed entirely. Replaced with `POST /api/users/me` endpoint in new `UsersController` — same upsert + pending share resolution logic, but now called once on login from `TokenSync.tsx` rather than on every request. `api.post` body param made optional to support the no-body call. Backend and frontend builds verified clean.
---
### H-2 — Reduce `BudgetAuthorizationService` to a Single Query
**File:** `src/Budget.Infrastructure/Services/BudgetAuthorizationService.cs`
**Risk:** `GetAccessAsync` always issues two sequential queries (budget row, then share row). Combined with controller entity loads, a single write endpoint makes 4+ DB round-trips; `GET /summary` makes 5.
**Fix:** Combine into one projected query:
```csharp
var result = await db.Budgets
.AsNoTracking()
.Where(b => b.Id == budgetId)
.Select(b => new {
b.OwnerUserId,
Share = b.Shares.FirstOrDefault(s => s.SharedWithUserId == userId && !s.IsPending)
})
.FirstOrDefaultAsync();
```
---
### ~~H-3 — Add Missing Database Indexes~~ ✅ DONE
**Resolved 2026-05-06:** Added 5 indexes to `AppDbContext.OnModelCreating`. Migration `20260507024323_AddHardeningIndexes` generated and verified. Replaces single-column `BudgetId` indexes on Incomes/Outgos with composite `(BudgetId, IsDeleted)` versions; adds `BudgetShares.(IsPending, SharedWithEmail)`, `BudgetShares.SharedWithUserId`, and `Budgets.OwnerUserId`. Will be applied automatically at next startup via `MigrateAsync()`.
---
### ~~H-4 — Cap `Reorder` Endpoint Input Size~~ ✅ DONE
**Resolved 2026-05-06:** Replaced full-array reorder with a move-by-position design. `ReorderIncomesRequest`/`ReorderOutgosRequest` removed and replaced with `MoveIncomeRequest(Guid Id, int NewIndex)` / `MoveOutgoRequest(Guid Id, int NewIndex)`. Server loads items sorted by SortOrder, finds the item, shifts only the affected range, sets the new position. Frontend sends `{ id, newIndex }` from dnd-kit's existing `oldIdx`/`newIdx``arrayMove` still used for optimistic local state. Input is inherently bounded to one GUID and one integer.
---
### ~~H-5 — Lock Down `ForwardedHeaders` to Known Proxy IP~~ ✅ DONE
**Resolved 2026-05-06:** Replaced clear-all with explicit subnet trust. `KnownIPNetworks` (the .NET 10 replacement for deprecated `KnownNetworks`) is cleared and populated with `172.20.0.0/16` — the Docker network subnet for the nginx-proxy stack. Using the subnet rather than a specific container IP avoids breakage if nginx-proxy gets a different IP on restart. Deprecation warning eliminated; build clean with 0 warnings.
---
## Medium
### M-1 — OIDC JWKS Fetched Over Plain HTTP
**File:** `src/Budget.Api/Program.cs`
**Risk:** `MetadataAddress` points to `http://auth:8080/...`. JWKS (token signing keys) arrive over plain HTTP. On a compromised Docker network, an attacker could substitute keys and forge valid JWTs. Low severity on a single-host deployment; higher risk in any multi-host topology.
**Fix:** Enable internal TLS on the auth service, or pin the signing key in configuration instead of relying on discovery.
---
### ~~M-2 — Unknown `Frequency` Enum Crashes Entire Budget~~ ✅ DONE
**File:** `src/Budget.Core/Services/FrequencyCalculator.cs`
**Risk:** An unrecognised enum integer (data corruption or future migration) throws `ArgumentOutOfRangeException`, making all income/outgo/summary endpoints permanently broken for that budget and generating error spans that amplify OTel export volume.
**Fix:** Validate enum values at the controller layer with `[EnumDataType(typeof(Frequency))]` on write DTOs. Return a graceful default or 422 in the calculator rather than throwing.
---
### ~~M-3 — Missing `AsNoTracking()` on Read Queries~~ ✅ DONE
**Files:** `src/Budget.Api/Controllers/SummaryController.cs`, `IncomesController.cs`, `OutgosController.cs`
**Risk:** EF change tracking runs on every loaded entity even for pure-read endpoints, wasting CPU/memory. With OTel EF instrumentation active, every tracked entity event adds overhead.
**Fix:** Add `.AsNoTracking()` to all `List` and `Get` queries that do not write.
---
### ~~M-4 — `Thread.Sleep` in Async Startup Loop~~ ✅ DONE
**File:** `src/Budget.Api/Program.cs`
**Risk:** Synchronous `Thread.Sleep` inside the async OIDC discovery retry loop blocks a thread pool thread for up to 2 minutes during container startup.
**Fix:**
```csharp
await Task.Delay(TimeSpan.FromSeconds(5));
```
---
### ~~M-5 — `AllowedHosts: "*"` Disables Host Header Validation~~ ✅ DONE
**File:** `src/Budget.Api/appsettings.json`
**Risk:** Allows HTTP host header injection if the app is ever behind a misconfigured proxy.
**Fix:** Set to the actual public hostname:
```json
"AllowedHosts": "budget.stwaddle.com"
```
---
## Low / Informational
| ID | File | Issue | Fix |
|----|------|-------|-----|
| L-1 | `appsettings.json` | Docker-internal `MetadataAddress` in base config — if the Docker service name `auth` is not resolvable, OIDC discovery silently fails | Move `MetadataAddress` to a Docker Compose env override; base config default should be empty |
| L-2 | `Program.cs` | No timeout on startup `MigrateAsync()` — hangs indefinitely if Postgres is unresponsive | Wrap with a `CancellationTokenSource` with a deadline |
| L-3 | `ErrorHandlingMiddleware.cs` | Full exception stack traces exported via OTel on every unhandled error — exception storms spike collector disk I/O | Consider `LogWarning` for expected exceptions; reserve `LogError` for truly unexpected ones |
---
## Implementation Order
1. ~~**C-1** — Guard `UseOtlpExporter()` (likely root cause; deploy ASAP)~~ ✅ Done (2026-05-06)
2. ~~**H-1** — Throttle `KnownUserMiddleware` writes (amplifies Disk I/O on every request)~~ ✅ Done (2026-05-06)
3. ~~**C-2** — Auth-gate or simplify `/healthz`~~ ✅ Removed entirely (2026-05-06)
4. ~~**H-3** — Add missing indexes (single migration, low risk)~~ ✅ Done (2026-05-06)
5. ~~**C-3** — Move rate limiter before static files~~ ✅ Done (2026-05-06)
6. ~~**H-4** — Cap `Reorder` input~~ ✅ Done — redesigned as move-by-position (2026-05-06)
7. ~~**H-5** — Lock down `ForwardedHeaders` to proxy IP~~ ✅ Done (2026-05-06)
8. ~~**M-3** — Add `AsNoTracking()` across read controllers~~ ✅ Done (2026-05-06)
9. ~~**M-2** — Harden `FrequencyCalculator` enum handling~~ ✅ Done (2026-05-06)
10. ~~**M-4** — `Thread.Sleep` → `await Task.Delay` in startup loop~~ ✅ Done (2026-05-06)
11. ~~**M-5** — `AllowedHosts: "*"` → `budget.stwaddle.com`~~ ✅ Done (2026-05-06)