## Import Hardening: Wave Coordination + Parent-First Lines

### Purpose

Establish a clear, durable execution model that guarantees one active wave at a time, coordinated exclusively by the Wave Coordinator, while integrating the parent-first transaction line strategy without queue saturation, race conditions, or monitoring inconsistencies.

### Non-Goals

- Do not introduce feature flags for this behavior.
- Do not modify legacy ImportNetSuiteRecords.php; it must remain fully functional.

## Principles

- **Coordinator-only dispatch**: Only the Wave Coordinator dispatches jobs. Listeners/services may create waves/batches but must never queue jobs.
- **One active wave**: Exactly one wave per job is in processing at any time.
- **Transactional creation**: Wave and batch creation commits before any dispatch attempt.
- **Durable idempotency**: Duplicate creation/dispatch paths are safe and no-ops.
- **Accurate monitoring**: UI totals reflect DB truth immediately after creation.
- **Parent-first correctness**: Lines derive only after parent completion and use authoritative parent IDs from DB.

## Coordinator Execution Model

### Wave FSM (per job_id, wave_number)

- pending → dispatching → processing → completed
- State changes must be atomic; never skip states.

### Dispatch Guard (single active wave)

- Before dispatching any wave, coordinator checks if any other wave for the same job has status=processing. If so, abort dispatch with a log.

### Pending Recovery

- When dispatching, select batches with status='pending'. If none found, compute total_count. If total_count>0 and pending_count==0, set all to pending and retry once; log recovered count.

## Wave Creation & Awareness

### Transactional Creation

- All wave and batch insertions occur inside a DB transaction.
- Only after commit:
  - Recompute totals from wave_coordination to refresh monitoring cache/UI.
  - Trigger a coordinator progression check (not dispatch) to allow the coordinator to decide the next wave.

### Coordinator Progression Triggers

- Coordinator periodic tick (existing) continues.
- Additionally call coordinator progression immediately after:
  - Dependency waves complete (to start first main wave).
  - Main wave completion (to start next main wave).
  - Derived TransactionLine waves are created (to start first derived wave).

## Parent-First Integration (TransactionLine)

### Exclusion From Main Waves

- `TransactionLine` (record_type_id = -13) is never part of main waves. It is derived after parent completion.

### Derived Wave Creation (after parents complete)

- Gating condition: parent completion ≥ config('waves.completion_threshold') (100%).
- Idempotent guard: skip if any TransactionLine wave_batches already exist for job_id or a DB durable marker indicates initialization.
- Parent IDs source of truth: use SalesOrders table (refid) filtered by the import window for complete coverage; optionally reconcile with cached IDs.
- Chunk parent IDs by `config('waves.line_fetch.in_batch_size')` (default 500) to form batches; generate a stable hash-based `batch_id` for idempotency.
- Persist IN() ids:
  - Prefer storing chunk ids in a persistent column (e.g., `chunk_ids_json` on `wave_batches`) to avoid cache TTL dependency; cache remains a fast path.

### Dispatch of Derived Waves

- Listener/service creates derived waves/batches, updates monitoring, then triggers coordinator progression check. Coordinator decides when to dispatch the first derived wave.
- Optional hard guard: coordinator skips TransactionLine dispatch unless a durable “derived ready” marker is set for the job.

## Monitoring & UI

- Monitoring totals (total_waves, main/derived waves) derive from DB: `wave_coordination` and `wave_batches`.
- After any creation of waves/batches, refresh monitoring cache with DB totals; never rely on stale snapshots.
- Coordinator logs compact summaries:
  - Creation: wave_number, total_batches
  - Dispatch: batches_dispatched
  - Completion: completed_batches/total_batches
  - Recovery: recovered pending batches count

## Idempotency & Durability

- `wave_batches` unique index on (job_id, batch_id). Hash-based `batch_id` for derived line chunks (`sha1` of sorted ids, truncated) ensures stable, safe retries.
- Durable initialization guard for derived waves (e.g., presence of -13 `wave_batches` rows) supersedes any cache flag.

## Recovery & Reconcile Tools

- Coordinator reconcile command (new):
  - Move waves stuck in dispatching to processing if dispatched_at is stale and batches exist.
  - Requeue batches stuck in pending/processing past TTL (respecting retry limits).
  - Refresh monitoring and trigger progression check.

## Configuration

- `config/waves.php` is the single source for:
  - `wave_size` (e.g., 50)
  - `completion_threshold` (100)
  - `line_fetch.in_batch_size` (500) and page sizes
  - timeouts, inter-wave delays, retry attempts

## Acceptance Criteria

- **Single Wave Active**: At any time, for a given job, exactly one wave has status=processing; attempts to dispatch another wave log and abort.
- **Coordinator-only Dispatch**: No job is queued outside `WaveCoordinator->dispatchWave`.
- **Transactionally Created Waves**: Parent (main) waves and derived (lines) waves are created within a committed transaction; monitoring shows new totals immediately.
- **Parent-First Correctness**: Derived TransactionLine waves are created only after parent completion (threshold met) and use DB-based parent IDs. Derived waves are never included in main wave planning.
- **Monitoring Accuracy**: UI shows updated total waves and batch counts immediately after wave creation; no premature “job complete” state while pending waves exist.
- **Pending Recovery**: When total_count>0 and no `pending` rows exist, coordinator recovers to `pending` and successfully dispatches.
- **Idempotency**: Re-running initialization or dispatch cannot create duplicate waves/batches or queue duplicate jobs.

## Test Assertions (Pest Examples)

- Coordinator dispatch guard:
  - Given one wave in `processing`, attempting to dispatch another wave logs and aborts; no jobs queued; second wave remains `pending`.
- Transactional wave creation:
  - After creating main waves in a transaction, `dispatchWave` finds `pending` batches and dispatches; wave status transitions to `processing`.
- Pending recovery:
  - With total_count>0 and no `pending` (simulate status drift), `dispatchWave` sets to pending, retries, and dispatches successfully.
- Parent-first gating:
  - Before parent completion (threshold not met), derived waves are not created.
  - After threshold, derived waves are created once (idempotent) with expected chunk count (SalesOrder refid based), and first derived wave is dispatched by the coordinator upon progression check.
- Coordinator-only dispatch:
  - Listener creates waves but never queues jobs; Bus/Queue fakes detect no dispatch calls from the listener.
- Monitoring totals:
  - After derived creation, monitoring cache reflects increased `total_waves`; UI-facing reader sees the updates.

## Rollout & Legacy Safeguards

- Legacy ImportNetSuiteRecords.php is untouched and remains functional.
- No feature flags. Behavior is uniform across tenants.
- Apply schema updates (if used for chunk persistence and unique index) with safe migrations:
  - `wave_batches`: add `chunk_ids_json` (TEXT), add unique index `(job_id, batch_id)`.

## Implementation Notes

- Ensure all coordinator DB reads/writes use the same tenant_connection and consistent transactions where applicable.
- Prefer DB as the source of truth; cache only accelerates checks and should be tolerant of misses.
- Keep logs concise and structured to aid UI monitoring and incident triage.

## Schema Additions (Migrations)

### 1) wave_batches: persist chunk ids and enforce idempotency

```php
Schema::connection('tenant_connection')->table('wave_batches', function (Blueprint $table) {
    if (!Schema::connection('tenant_connection')->hasColumn('wave_batches', 'chunk_ids_json')) {
        $table->longText('chunk_ids_json')->nullable()->after('total_batches');
    }
});

// Unique index for idempotency (safe-guard against duplicate creation)
Schema::connection('tenant_connection')->table('wave_batches', function (Blueprint $table) {
    $indexName = 'wave_batches_job_batch_unique';
    // Add if not exists (implementation depends on DB driver; guard in code if needed)
    $table->unique(['job_id', 'batch_id'], $indexName);
});
```

Fields summary:
- `chunk_ids_json`: JSON array of parent IDs for the TransactionLine IN() chunk; coordinator loads from DB (cache is optional).
- Unique `(job_id, batch_id)`: prevents duplicate batch rows across retries/races.

### 2) Optional derived-lines ready marker (if using DB marker)

Either rely on presence of `-13` rows in `wave_batches` for job_id, or add a minimal table to mark initialization:

```php
Schema::connection('tenant_connection')->create('derived_lines_init', function (Blueprint $table) {
    $table->id();
    $table->string('job_id')->index();
    $table->integer('parent_record_type_id')->index(); // e.g., -19 for SalesOrder
    $table->timestamp('initialized_at');
    $table->unique(['job_id', 'parent_record_type_id']);
});
```

Coordinator/Listener usage:
- On successful derived creation, insert the row. Coordinator may check this to validate readiness before dispatching TransactionLine waves.



## Post-Implementation Hardening Updates (2025-09-29)

### Summary of Changes After Initial Rollout

- Introduced robust, DB-first progression and dispatching to eliminate cache brittleness and stale state issues.
- Separated derived TransactionLine waves into their own dependency level to prevent contention with main waves.
- Made wave dispatch atomic and idempotent to stop parallel wave races and last-wave stalls.
- Ensured derived batch payload completeness and immediate event-driven progression after creation.

### Coordinator Improvements

- Single-active-wave enforcement (enhanced):
  - Before dispatching, if a wave is marked `processing` but has zero active batches (`dispatched` or `processing`), demote it to `pending` and continue. This recovers from stale states.
  - Mark a wave `processing` only if at least one batch is actually dispatched; otherwise leave it `pending` for retry.

- Atomic dispatch (CAS):
  - Dispatch now performs a compare-and-set update: `WHERE status='pending'` → `status='dispatching'`. If zero rows are affected, another coordinator already won; skip duplicate dispatch.

- Derived wave progression (new, DB-driven):
  - Derived TransactionLine waves are now discovered and dispatched by querying `wave_batches` joined to `wave_coordination` with `dependency_level = -2` for `status = pending`.
  - The “derived-ready” guard now falls back to DB presence of `-13` batches; if found, readiness is assumed and cached.
  - Dispatch payload injection for TransactionLine uses `wave_batches.chunk_ids_json` when present, otherwise falls back to the cache `line_chunk_ids_{batchId}`.

- Payload hydration safeguard (new):
  - Immediately before dispatch, if `integration_id` or `tenant_database` is missing in the batch metadata, the coordinator hydrates them from DB/config to prevent “payload corrupted” errors.

- Progression independent of cache (new):
  - `checkAndTriggerNextWave(jobId)` no longer returns early due to missing cache hints. It reconciles using DB truth and updates caches only as a best-effort when active.

### Listener Improvements (Derived Creation)

- Parent ID sourcing (resilient):
  - If cached `parent_ids_{jobId}_{recordTypeId}` is empty, read Sales Order IDs (`salesorders.refid`) from DB as a fallback to proceed with derived creation.
  - Removed early returns on missing cache to avoid silent failure.

- Derived dependency level separation (change):
  - Derived waves are now created with `dependency_level = -2` (previously reused `-1`). If a `wave_coordination` row pre-existed, it is updated to `-2` for correctness.

- `integration_id` population (fix):
  - Derived `wave_batches` now populate `integration_id` by reading an existing batch for the job. If not discoverable, default to `1` (logged) to avoid payload failures.

- Optional `chunk_ids_json` (compatibility):
  - Include `chunk_ids_json` only if the column exists in `wave_batches`. This avoids SQL errors on tenants that have not yet run the migration.

- Event-driven progression (new):
  - After committing derived batches, emit `DerivedWavesCreated(jobId, waveNumber)` event. A dedicated listener immediately invokes the coordinator to dispatch derived waves without relying on cache/tick latency.

### Reconcile Command (Operational)

- Tenant DB awareness and direct progression:
  - Added `--tenant` and `--db` options; command sets `tenant_connection` to the provided DB, purges and reconnects.
  - After performing recovery (stale dispatching, pending resets), it directly calls the coordinator progression and reports the result (main or derived wave triggered). This removes reliance on background ticks.

### Cache Brittleness Removed / Replaced

- Derived-ready flag:
  - Replaced cache-only readiness with DB fallback (presence of `-13` batches). The cache key is still set as a hint when DB indicates ready.

- Monitoring cache dependency:
  - The coordinator no longer aborts progression when `wave_monitoring` or `wave_monitoring_active` cache entries are absent. It now reconciles from DB and only updates cache if active.

- IN() parent IDs:
  - Prefer DB-stored `chunk_ids_json`; cache remains a fast path fallback.

### Additions vs Removals

- Added:
  - `DerivedWavesCreated` event and `DerivedWavesCreatedListener` to immediately trigger progression after derived creation.
  - CAS dispatch in coordinator and demotion of stale `processing` waves before dispatch.
  - Payload hydration for `integration_id` and `tenant_database` at dispatch time.
  - DB fallback for parent IDs and derived-ready checks.

- Removed/Replaced:
  - Replaced direct post-commit coordinator call with the event-driven approach to decouple concerns and improve reliability.
  - Replaced strict cache dependencies in progression with DB-first reconciliation.
  - Replaced strict requirement for `chunk_ids_json` with schema-checked optional inclusion.

### Why These Changes Were Needed

- Live runs revealed stalls after the last main wave: derived `-13` batches existed but were never dispatched. Root causes included state drift (waves marked `processing` without any active batches), cache-dependent progression missing new waves, payload corruption (`integration_id` null), and level conflation between main and derived waves.
- The changes make progression deterministic, remove cache as a gate, ensure payload completeness, and separate derived waves so they are not blocked by main-wave guards.

### Operational Guidance

- On production incidents:
  - Use `php artisan waves:reconcile {jobId} --tenant=tenant_connection --db={tenant_db}` to recover stale waves, requeue pending, and immediately trigger progression.
  - Verify derived `-13` batches have `integration_id` and either `chunk_ids_json` populated or corresponding cache keys for `line_chunk_ids_{batchId}`.
  - Confirm `wave_coordination` shows derived waves with `dependency_level = -2`.

### Future Enhancements (Optional)

- Block job completion until all `dependency_level = -2` waves complete (if not already enforced by UI/state), to prevent early perceived completion while derived are pending.
- Add a small DB poll backoff when no pending waves found but totals exist (rare race) to further harden last-wave dispatch.
