OAuth State Migration and Security Hardening (SOK-76, SOK-77)

Summary

The migration of the Grimoire (Wise Delivery) OAuth state management from volatile in-memory storage to a persistent Redis backend, alongside significant security hardening of the authorization flow. This process addressed critical vulnerabilities in token verification, implemented Proof Key for Code Exchange (PKCE), and tightened Dynamic Client Registration (DCR) protocols.

Details

This development phase focused on transitioning the Grimoire system’s OAuth implementation from a development-centric in-memory model to a production-ready infrastructure. The work was tracked under three primary Linear tickets: SOK-76 (Redis state), SOK-77 (None-safety and DCR), and SOK-81 (Test suite expansion).

Redis State Backend (SOK-76)

To support horizontal scaling and persistence across container restarts in Azure Container Apps, the system’s three primary OAuth state dictionaries—registered_clients, auth_codes, and pending_authorizations—were replaced with a Redis-backed store.

  • Infrastructure: Introduced redis[hiredis]>=5.0.0 as a dependency.
  • Implementation: Created infrastructure/redis/client.py for async Redis management and infrastructure/redis/oauth_store.py for the OAuthStateStore.
  • Key Patterns: Data is partitioned using specific key prefixes:
    • grimoire:oauth:pending:{auth_id} (10-minute TTL)
    • grimoire:oauth:auth_code:{code} (10-minute TTL)
    • grimoire:oauth:client:{client_id} (Persistent)
  • Configuration: The system uses a REDIS_URL environment variable. For local development, this can be left empty (triggering a warning), but for production on Azure, it requires a rediss:// URL to enable TLS for Azure Cache for Redis.
  • Resilience: The /health endpoint was updated to monitor Redis connectivity. If Redis is unavailable, the system reports a degraded status, and OAuth attempts return a 503 Service Unavailable via the _require_redis() guard.

Security Hardening and OAuth Refactoring (SOK-77)

The OAuth flow in oauth.py underwent a full rewrite to address safety and security gaps:

  • None-Safety: Fixed issues in verify_token() where payload.get("sub") and email fields could return None, causing downstream crashes. Explicit isinstance checks were added for sub, email, and scopes.
  • PKCE Implementation: Added support for PKCE S256 verification. The code_challenge is stored during authorization and verified against the code_verifier during the token exchange.
  • DCR Validation: Tightened Dynamic Client Registration. The system no longer allows auto-registration of clients based on ID prefixes (e.g., client_). Clients must now be explicitly registered via the /oauth/register endpoint.
  • Validation Logic: Added redirect_uri validation against registered URIs and implemented secrets.compare_digest for secure client_secret validation.
  • Logging: Replaced standard f-string logging with structured kwargs to improve observability in production logs.

Verification and Testing (SOK-81)

A comprehensive test suite was built to verify these changes, achieving 60% overall code coverage and 96% coverage for OAuth-specific logic. The suite includes 227 green tests covering:

  • OAuth E2E: Well-known discovery, DCR registration, PKCE success/failure, and refresh token flows.
  • Domain Models: Validation of MemoryType enums, Neo4j label conversions, and discriminated union dispatch.
  • Infrastructure: CRUD operations for the Redis store and Cypher query generation for the Neo4j backend.