ac3dcc2f31
Addresses production CPU spike incident. Key changes: - Guard OTel exporter behind OTEL_EXPORTER_OTLP_ENDPOINT env var; filter tracing to /api paths only — unconditional export was primary suspect - Remove /healthz endpoint entirely (unauthenticated, hit DB on every call) - Replace KnownUserMiddleware with POST /api/users/me called once on login from TokenSync — eliminates unconditional DB write on every request - Add DB indexes: (BudgetId, IsDeleted) on Incomes/Outgos, OwnerUserId on Budgets, SharedWithUserId and (IsPending, SharedWithEmail) on BudgetShares - Move UseRateLimiter() before UseStaticFiles() so all requests are throttled - Replace full-array reorder with move-by-position (id + newIndex) — bounded input, fewer DB writes, better API design - Lock ForwardedHeaders to 172.20.0.0/16 subnet; fixes KnownNetworks deprecation warning (0 warnings in build now) - Add AsNoTracking() to all read-only queries in Summary/Incomes/OutgosController - FrequencyCalculator returns 0 for unknown enum values instead of throwing - Thread.Sleep → await Task.Delay in OIDC startup loop - AllowedHosts locked to budget.stwaddle.com Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
139 lines
8.2 KiB
Markdown
139 lines
8.2 KiB
Markdown
# Security & Resource Hardening Plan
|
|
**Date:** 2026-05-06
|
|
**Trigger:** Production host CPU pegged >100% for several hours; Disk I/O high; network normal.
|
|
**Root cause hypothesis:** C-1 (unconditional OTel exporter) + H-1 (unconditional DB write per request).
|
|
|
|
---
|
|
|
|
## Critical
|
|
|
|
### ~~C-1 — Guard `UseOtlpExporter()` Behind Env Var Check~~ ✅ DONE
|
|
**Resolved 2026-05-06:** Entire OTel block (logging + tracing + exporter) is now gated on `OTEL_EXPORTER_OTLP_ENDPOINT` being set. AspNetCore instrumentation also filtered to `/api` paths only so static file requests never generate spans. The duplicate `builder.Logging.AddOpenTelemetry()` is inside the same guard, eliminating the double-pipeline issue. Build verified clean.
|
|
|
|
---
|
|
|
|
### ~~C-2 — Require Auth on `/healthz` (or Remove DB Check)~~ ✅ DONE
|
|
**Resolved 2026-05-06:** Endpoint removed entirely. Not wired to any orchestrator health check; budget app is lower priority than recipes/journal. `AddHealthChecks()`, `AddDbContextCheck<AppDbContext>()`, `MapHealthChecks()`, and both health-check `using` statements removed from `Program.cs`. Build verified clean.
|
|
|
|
---
|
|
|
|
### ~~C-3 — Move `UseRateLimiter()` Before `UseStaticFiles()`~~ ✅ DONE
|
|
**File:** `src/Budget.Api/Program.cs`
|
|
**Risk:** `UseStaticFiles` short-circuits the pipeline before `UseRateLimiter`, so all SPA asset requests (JS bundles, CSS, `index.html` fallback) bypass rate limiting. Combined with C-1, every unthrottled static file request still generates a telemetry span.
|
|
|
|
**Fix:** Reorder middleware so `app.UseRateLimiter()` appears before `app.UseDefaultFiles()` and `app.UseStaticFiles()`.
|
|
|
|
---
|
|
|
|
## High
|
|
|
|
### ~~H-1 — Throttle `KnownUserMiddleware` DB Writes~~ ✅ DONE
|
|
**Resolved 2026-05-06:** `KnownUserMiddleware` removed entirely. Replaced with `POST /api/users/me` endpoint in new `UsersController` — same upsert + pending share resolution logic, but now called once on login from `TokenSync.tsx` rather than on every request. `api.post` body param made optional to support the no-body call. Backend and frontend builds verified clean.
|
|
|
|
---
|
|
|
|
### H-2 — Reduce `BudgetAuthorizationService` to a Single Query
|
|
**File:** `src/Budget.Infrastructure/Services/BudgetAuthorizationService.cs`
|
|
**Risk:** `GetAccessAsync` always issues two sequential queries (budget row, then share row). Combined with controller entity loads, a single write endpoint makes 4+ DB round-trips; `GET /summary` makes 5.
|
|
|
|
**Fix:** Combine into one projected query:
|
|
```csharp
|
|
var result = await db.Budgets
|
|
.AsNoTracking()
|
|
.Where(b => b.Id == budgetId)
|
|
.Select(b => new {
|
|
b.OwnerUserId,
|
|
Share = b.Shares.FirstOrDefault(s => s.SharedWithUserId == userId && !s.IsPending)
|
|
})
|
|
.FirstOrDefaultAsync();
|
|
```
|
|
|
|
---
|
|
|
|
### ~~H-3 — Add Missing Database Indexes~~ ✅ DONE
|
|
**Resolved 2026-05-06:** Added 5 indexes to `AppDbContext.OnModelCreating`. Migration `20260507024323_AddHardeningIndexes` generated and verified. Replaces single-column `BudgetId` indexes on Incomes/Outgos with composite `(BudgetId, IsDeleted)` versions; adds `BudgetShares.(IsPending, SharedWithEmail)`, `BudgetShares.SharedWithUserId`, and `Budgets.OwnerUserId`. Will be applied automatically at next startup via `MigrateAsync()`.
|
|
|
|
---
|
|
|
|
### ~~H-4 — Cap `Reorder` Endpoint Input Size~~ ✅ DONE
|
|
**Resolved 2026-05-06:** Replaced full-array reorder with a move-by-position design. `ReorderIncomesRequest`/`ReorderOutgosRequest` removed and replaced with `MoveIncomeRequest(Guid Id, int NewIndex)` / `MoveOutgoRequest(Guid Id, int NewIndex)`. Server loads items sorted by SortOrder, finds the item, shifts only the affected range, sets the new position. Frontend sends `{ id, newIndex }` from dnd-kit's existing `oldIdx`/`newIdx` — `arrayMove` still used for optimistic local state. Input is inherently bounded to one GUID and one integer.
|
|
|
|
---
|
|
|
|
### ~~H-5 — Lock Down `ForwardedHeaders` to Known Proxy IP~~ ✅ DONE
|
|
**Resolved 2026-05-06:** Replaced clear-all with explicit subnet trust. `KnownIPNetworks` (the .NET 10 replacement for deprecated `KnownNetworks`) is cleared and populated with `172.20.0.0/16` — the Docker network subnet for the nginx-proxy stack. Using the subnet rather than a specific container IP avoids breakage if nginx-proxy gets a different IP on restart. Deprecation warning eliminated; build clean with 0 warnings.
|
|
|
|
---
|
|
|
|
## Medium
|
|
|
|
### M-1 — OIDC JWKS Fetched Over Plain HTTP
|
|
**File:** `src/Budget.Api/Program.cs`
|
|
**Risk:** `MetadataAddress` points to `http://auth:8080/...`. JWKS (token signing keys) arrive over plain HTTP. On a compromised Docker network, an attacker could substitute keys and forge valid JWTs. Low severity on a single-host deployment; higher risk in any multi-host topology.
|
|
|
|
**Fix:** Enable internal TLS on the auth service, or pin the signing key in configuration instead of relying on discovery.
|
|
|
|
---
|
|
|
|
### ~~M-2 — Unknown `Frequency` Enum Crashes Entire Budget~~ ✅ DONE
|
|
**File:** `src/Budget.Core/Services/FrequencyCalculator.cs`
|
|
**Risk:** An unrecognised enum integer (data corruption or future migration) throws `ArgumentOutOfRangeException`, making all income/outgo/summary endpoints permanently broken for that budget and generating error spans that amplify OTel export volume.
|
|
|
|
**Fix:** Validate enum values at the controller layer with `[EnumDataType(typeof(Frequency))]` on write DTOs. Return a graceful default or 422 in the calculator rather than throwing.
|
|
|
|
---
|
|
|
|
### ~~M-3 — Missing `AsNoTracking()` on Read Queries~~ ✅ DONE
|
|
**Files:** `src/Budget.Api/Controllers/SummaryController.cs`, `IncomesController.cs`, `OutgosController.cs`
|
|
**Risk:** EF change tracking runs on every loaded entity even for pure-read endpoints, wasting CPU/memory. With OTel EF instrumentation active, every tracked entity event adds overhead.
|
|
|
|
**Fix:** Add `.AsNoTracking()` to all `List` and `Get` queries that do not write.
|
|
|
|
---
|
|
|
|
### ~~M-4 — `Thread.Sleep` in Async Startup Loop~~ ✅ DONE
|
|
**File:** `src/Budget.Api/Program.cs`
|
|
**Risk:** Synchronous `Thread.Sleep` inside the async OIDC discovery retry loop blocks a thread pool thread for up to 2 minutes during container startup.
|
|
|
|
**Fix:**
|
|
```csharp
|
|
await Task.Delay(TimeSpan.FromSeconds(5));
|
|
```
|
|
|
|
---
|
|
|
|
### ~~M-5 — `AllowedHosts: "*"` Disables Host Header Validation~~ ✅ DONE
|
|
**File:** `src/Budget.Api/appsettings.json`
|
|
**Risk:** Allows HTTP host header injection if the app is ever behind a misconfigured proxy.
|
|
|
|
**Fix:** Set to the actual public hostname:
|
|
```json
|
|
"AllowedHosts": "budget.stwaddle.com"
|
|
```
|
|
|
|
---
|
|
|
|
## Low / Informational
|
|
|
|
| ID | File | Issue | Fix |
|
|
|----|------|-------|-----|
|
|
| L-1 | `appsettings.json` | Docker-internal `MetadataAddress` in base config — if the Docker service name `auth` is not resolvable, OIDC discovery silently fails | Move `MetadataAddress` to a Docker Compose env override; base config default should be empty |
|
|
| L-2 | `Program.cs` | No timeout on startup `MigrateAsync()` — hangs indefinitely if Postgres is unresponsive | Wrap with a `CancellationTokenSource` with a deadline |
|
|
| L-3 | `ErrorHandlingMiddleware.cs` | Full exception stack traces exported via OTel on every unhandled error — exception storms spike collector disk I/O | Consider `LogWarning` for expected exceptions; reserve `LogError` for truly unexpected ones |
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
1. ~~**C-1** — Guard `UseOtlpExporter()` (likely root cause; deploy ASAP)~~ ✅ Done (2026-05-06)
|
|
2. ~~**H-1** — Throttle `KnownUserMiddleware` writes (amplifies Disk I/O on every request)~~ ✅ Done (2026-05-06)
|
|
3. ~~**C-2** — Auth-gate or simplify `/healthz`~~ ✅ Removed entirely (2026-05-06)
|
|
4. ~~**H-3** — Add missing indexes (single migration, low risk)~~ ✅ Done (2026-05-06)
|
|
5. ~~**C-3** — Move rate limiter before static files~~ ✅ Done (2026-05-06)
|
|
6. ~~**H-4** — Cap `Reorder` input~~ ✅ Done — redesigned as move-by-position (2026-05-06)
|
|
7. ~~**H-5** — Lock down `ForwardedHeaders` to proxy IP~~ ✅ Done (2026-05-06)
|
|
8. ~~**M-3** — Add `AsNoTracking()` across read controllers~~ ✅ Done (2026-05-06)
|
|
9. ~~**M-2** — Harden `FrequencyCalculator` enum handling~~ ✅ Done (2026-05-06)
|
|
10. ~~**M-4** — `Thread.Sleep` → `await Task.Delay` in startup loop~~ ✅ Done (2026-05-06)
|
|
11. ~~**M-5** — `AllowedHosts: "*"` → `budget.stwaddle.com`~~ ✅ Done (2026-05-06)
|