# Wave Coordination – Duplicate Jobs Analysis

Date: 2025-09-16
Job ID observed: `import_68c9c70e90d772.06599674`

## Problem
Duplicate wave coordination attempts are occurring for the same job, leading to a unique constraint violation on `wave_coordination` (key `unique_job_wave`). Result: job marked failed and no transaction lines persisted despite a successful COUNT.

## Evidence (from logs)

- Successful COUNT for transaction lines (4,812,213):
  - 20:24:18 and 20:25:45 – both runs logged the same total count and batch calculations (9,625 batches @ 500).
- First coordination run creates wave rows:
  - 20:24:20 – "Wave coordination record created" for wave_number 1, 2, 3
- Second coordination run tries to create again and fails:
  - 20:25:47 – ERROR: `Duplicate entry 'import_...-1' for key 'unique_job_wave'`
- Orphaned-looking batch logs:
  - 20:33:26–20:33:32 – "Record type loaded successfully" for `transactionline` batches 36–40 with generated queries, but no completion inserts recorded.

Conclusion from logs: the same master job executed coordination twice (re-entrant), the second attempt collided with wave rows already created by the first.

## Root Cause Hypotheses

1) Coordinator re-entry due to overlapping triggers
- Hypothesis: Two separate triggers (e.g., initial path + fallback timer) both invoked wave coordination for the same job_id.
- Support: We see two distinct "Dependency graph completed"/COUNT sequences followed by coordination; second attempt hits unique constraint.

2) Master job duplicate execution (dispatch/retry)
- Hypothesis: The master import job (or a wrapper) was dispatched/retried twice with the same job_id, causing two coordination attempts.
- Support: Timestamps show a second coordination cycle ~87 seconds after first (aligned with COUNT duration). A retry policy might have fired while the first was still running.

3) Race condition in coordination start
- Hypothesis: Two workers executed the coordination start path concurrently (e.g., Horizon worker restart or multi-dispatch), racing to create wave rows.
- Support: Unique key violation on first wave suggests a second creator; no guard/lock around coordination observed in logs.

4) Fallback logic triggered incorrectly
- Hypothesis: A watchdog/fallback meant to recover stuck jobs fired while COUNT was legitimately long-running, and re-invoked coordination.
- Support: The second cycle begins shortly after the COUNT completed; a timer-based fallback could have fired during the 87s COUNT.

5) OAuth/PHP timeout changes indirectly exposed timing window
- Hypothesis: Increasing cURL/socket timeouts extended the window so fallback considered the job stalled and re-ran coordination.
- Support: Changes themselves don’t create duplicates, but longer runtimes can make fallback thresholds more likely to overlap.

Unlikely/ruled-out:
- NetSuite response handling/memory – duplicates observed before inserts, and unique-key violation is database-level coordination, not response processing.

## Recommended Solutions (simple-first)

S1) Idempotent wave creation (DB no-op on duplicate)
- Approach: When creating `wave_coordination` rows, use an idempotent pattern:
  - Prefer `insert ... on duplicate key do nothing` or `firstOrCreate()` semantics keyed by `(job_id, wave_number)`.
  - Treat duplicate insert attempts as success (no exception), returning the existing row.
- Pros: Very simple; immediately stops hard failures; safe across restarts.
- Cons: Masks the fact a duplicate trigger occurred (diagnostics still recommended).

S2) Single-flight lock per job_id (Redis SETNX)
- Approach: Acquire `waves:coordination:{job_id}` lock (short TTL, e.g., 5 minutes) before coordination; release after rows are created.
- Pros: Prevents concurrent re-entry; minimal code; widely used pattern.
- Cons: Requires careful TTL & release handling; if process dies, TTL must be short to avoid deadlocks.

S3) Job state guard (idempotency flag)
- Approach: Persist a boolean/state field like `coordination_status = started|completed` and check before starting coordination; skip if already `started/completed`.
- Pros: Straightforward; readable operational state.
- Cons: Requires consistent updates; still needs DB idempotency for safety.

S4) Unify/disable overlapping triggers
- Approach: Ensure coordination is only initiated from one authoritative path (e.g., remove timer-based fallback, rely on event-driven only) or raise fallback thresholds beyond worst-case COUNT duration.
- Pros: Eliminates the root of duplicate invocations.
- Cons: If events are missed, progress could stall without fallback.

S5) Idempotent batch generation
- Approach: Ensure `wave_batches` has unique keys (e.g., `(job_id, batch_id)`), use upsert to avoid creating duplicate batch rows if a second pass occurs.
- Pros: Extra safety layer; avoids duplicate downstream work.
- Cons: Doesn’t prevent duplicate coordination attempts by itself.

## Assessment of OAuth/PHP Setting Changes

- Direct contribution: Unlikely. The OAuth/cURL/socket timeout changes do not create duplicate dispatches by themselves.
- Indirect contribution: Increasing the COUNT wall time (now ~87s) likely widened the window during which any pre-existing fallback watchdog considered the job stalled, causing a second coordination attempt.

## Final Recommendation (prioritize simplicity)

Implement S1 + S2 together:
1. Add an idempotent DB write pattern for wave creation – treat duplicate key as success and fetch existing row.
2. Wrap coordination in a short-lived Redis lock per `job_id` to prevent concurrent starts.

Optional (nice-to-have):
3. Log the coordination "source" (initial, fallback, retry) when entering the flow to help future diagnostics.
4. Review/raise fallback thresholds to exceed worst-case COUNT duration (e.g., > 120s) to avoid unnecessary re-triggers.

Rationale: S1 defangs the failure (no hard error on duplicate), S2 prevents concurrent re-entry; both are small, reliable, and avoid over-engineering. OAuth/PHP timeout changes should remain as-is; they improved stability and simply exposed timing issues in fallback/coordination.


