# Connection Pool Safety & Data Preservation

**Context / Constraints**

* The feature is behind a feature flag. Backward compatibility is not required.
* Current DB connection manager allows requests to fail silently when pool limits are reached → risk of permanent data loss.
* Goal: prevent any silent loss of batch completion events and ensure imports can recover if connection capacity is temporarily exhausted.

---

## Priority A — Emergency Patch (DO THIS FIRST, today)

Goal: Prevent silent data loss for `BatchJobCompleted` and other critical listeners immediately.

1. **Implement enforced connection acquisition with wait & retry**

   * Modify `DatabaseConnectionManager::getConnection()` to:

     * Read configured limits: per-tenant cap and a global cap (use constants/config). If not present, default to safe small numbers (e.g., 20).
     * When current used connections ≥ cap, **block and retry** with exponential backoff:

       * Wait pattern: 0.5s → 1s → 2s → 4s → 5s (capped) up to a `maxAttempts` (e.g., 8–10).
       * Between attempts call a `cleanupIdleConnections()` helper.
     * If after `maxAttempts` still no slot, **throw an explicit exception** (so the caller sees a consistent failure).
   * Add immediate logging when waiting occurs: tenant\_id, job\_id, current\_count, max, attempt.

2. **Add resilient retries inside critical listeners (BatchJobCompletedListener)**

   * Wrap the listener work (setupTenantConnection, trackBatchCompletion, updateWaveCoordination) in a **bounded retry loop** (maxRetries = 3).
   * Use exponential backoff between retries (e.g., 2s, 4s, 8s).
   * On final failure, **do not let the event disappear**:

     * Push an emergency record into Redis (or a durable fallback queue). Include: job\_id, record\_type\_id, batch\_number, error, timestamp.
     * Log critical error with the fallback id.

3. **Emergency fallback storage**

   * Implement an emergency Redis list or set to store failed batch completions (`emergency_batch_completions`).
   * Structure must be compact and idempotent (avoid duplicates).
   * On push, log with unique key so recovery can find it.

4. **Short-circuit code paths that mark batches/flows complete**

   * Ensure batch/flow completion is only marked after the listener successfully persists completion or pushes to fallback.
   * Add a guard so the import **cannot** mark overall complete while there exist pending emergency fallback items.

5. **Quick tests (manual + automated)**

   * Unit test `getConnection` when pool is full (mock connection counts) → must wait, then succeed or throw after attempts.
   * Integration test for `BatchJobCompletedListener` to simulate connection exhaustion → confirm fallback entry created.

**Estimated time for Emergency Patch:** 2.5–4 hours

---

## Priority B — Stabilize & Harden (This week)

Goal: Make system robust under concurrency, avoid runaway worker exhaustion, and enable recovery.

1. **Circuit Breaker on connection pool**

   * Implement a simple circuit breaker in the connection manager:

     * Track consecutive failures; after N (e.g., 3) failures mark `circuit open`.
     * When open, immediately return a controlled failure to callers (and log). Allow recovery attempts after a cool-down window.
     * On switch to open/close, emit metrics/log events.

2. **Connection limits & throttling**

   * Determine `max_connections` for Cloud SQL instance(s).
   * Define:

     * `per_tenant_max` (e.g., max\_connections \* 0.2 / active\_tenants) — conservative.
     * `global_import_max` (e.g., max\_connections \* 0.6) leaving headroom.
   * Enforce caps at the connection manager and at worker configuration (limits for concurrent import workers).

3. **Ensure API success is durable**

   * For any API call that already returned success but still requires DB writes, buffer important data until writes succeed:

     * In critical listeners, ensure we persist completion to DB **or** fallback queue; only then consider the job “done.”
   * Add explicit instrumentation: record the lifecycle state so human operators can see “API success — waiting DB persistence.”

4. **Emergency queue processor**

   * Build a small consumer that periodically drains `emergency_batch_completions` and retries to persist them once connections free up.
   * Provide a manual, admin-triggerable runner to process emergency items.

5. **Worker configuration changes**

   * Update worker scaling/throttling to respect DB capacity:

     * Limit number of concurrent import workers per instance.
     * Use job rate limits or use a token semaphore to ensure not all workers attempt heavy DB concurrently.

6. **Logging & Monitoring**

   * Emit metrics and logs for: connection wait events, fallback pushes, circuit breaker open/close, emergency queue length.
   * Add alerts for: emergency queue > threshold, connection wait rate high, circuit breaker open.

**Estimated effort:** 1–2 days (design + implementation + tests)

---

## Priority C — Recovery & Validation (After stabilization)

1. **Recovery utility**

   * Add script/endpoint to replay emergency entries (idempotent), with dry-run option.
2. **End-to-end integration tests**

   * Simulate multi-tenant concurrent imports to validate no silent loss and verify emergency queue processing.
3. **Load testing**

   * Run a controlled load test to measure connection usage and tune per-tenant/global caps.
4. **Documentation**

   * Update ai\_rules.md / ops runbook describing emergency fallback behavior and recovery steps.

**Estimated effort:** 1–2 days

---

## Operational Requirements & Safeguards

* **Do not** let `failed()` handlers mark the job as fully completed without creating fallback entries for unprocessed batch completions.
* All fallback entries **must** be durable (Redis persistence or similar) and idempotent.
* Add an admin dashboard panel showing: current emergency queue size, oldest entry age, circuit breaker status.
* Ensure any retry/backoff is bounded — avoid indefinite spinning.

---

## Suggested Code Locations & Quick Implementation Notes (for the agent)

* `src/App/Services/DatabaseConnectionManager.php` — implement waiting, cleanup, circuit breaker.
* `app/Listeners/BatchJobCompletedListener.php` — add bounded retry + fallback push.
* `src/App/Services/EmergencyFallbackService.php` — encapsulate Redis push/pop and idempotency.
* Add tests under `tests/Unit/ConnectionManagerTest.php` and `tests/Feature/BatchCompletedListenerTest.php`.
* Add metrics/log tags: `connection.wait`, `connection.exhausted`, `emergency.fallback.pushed`, `circuitbreaker.open`.

---

## Quick Acceptance Criteria (for agent / dev)

1. When DB pool is saturated, `getConnection()` waits and retries; logs appear showing wait attempts.
2. If listener cannot persist after bounded retries, a fallback record is written to Redis and logged.
3. Emergency queue entries can be replayed idempotently and successfully process once connections available.
4. No data (batch completion) is lost silently — either persisted or stored in fallback.
5. Tests for these behaviors pass.

---

## Timeline Summary

* **Emergency Patch:** 2.5–4 hours (required today)
* **Harden & Throttle:** 1–2 days (this week)
* **Recovery & Validation:** 1–2 days

---

## Final Notes (must-follow for the agent)

* **Do not** skip any of the Priority A steps. They are required to stop data loss immediately.
* Implement changes behind the feature flag. Test thoroughly before enabling for all tenants.
* Provide a short report after Priority A with logs that show waiting behavior and at least one successfully stored fallback entry (simulated).

