# iPaaS Performance Bottleneck Analysis

**Date**: February 2, 2026  
**Status**: Per-Flow Throttling Implemented (Priority 1 Complete)

## Executive Summary

Large dataset processing through partition-based pagination causes worker saturation on the queue system. A flow processing 7,000 records with 3 downstream nodes generates 21,000 jobs, overwhelming the current 10-worker configuration. This document outlines the problem, analyzes bottlenecks, and recommends a layered solution combining throttling and infrastructure scaling.

---

## Problem Statement

### Observed Behavior

When processing a Snowflake dataset with ~11,000 rows using partition-based pagination:
- First partition: 706 records processed successfully
- Second partition: 6,793 records
- Remaining partitions: Processing stalled or severely delayed

### Root Cause

Worker saturation on the `default` queue. Jobs are dispatched faster than workers can process them, causing:
1. Queue depth explosion
2. Memory pressure on Redis
3. Processing delays cascade across all flows

---

## Technical Analysis

### Job Multiplication Factor

Each record entering the flow creates multiple jobs based on node count:

| Flow Configuration | Records | Nodes | Total Jobs |
|-------------------|---------|-------|------------|
| Current flow | 7,000 | 3 | **21,000** |
| Larger dataset | 11,000 | 3 | **33,000** |
| Complex flow | 7,000 | 5 | **35,000** |

**Current flow nodes:**
1. Transform (Heroku API call for JavaScript transformation)
2. Map (Local data mapping)
3. API Node (NetSuite POST)

### Job Cascade Pattern

```
Partition dumps 7,000 records
         │
         ▼
┌─────────────────────────────────────┐
│  7,000 Transform Jobs (Heroku)      │  ← Fast (~0.5-1s)
│  Dispatched immediately             │
└─────────────────────────────────────┘
         │ Each completion dispatches next
         ▼
┌─────────────────────────────────────┐
│  7,000 Map Jobs (Local)             │  ← Very fast (~0.1s)
│  Queue backs up here                │
└─────────────────────────────────────┘
         │ Each completion dispatches next
         ▼
┌─────────────────────────────────────┐
│  7,000 API Node Jobs (NetSuite)     │  ← Slow (1-5s + rate limits)
│  BOTTLENECK - Rate limited API      │
└─────────────────────────────────────┘
```

### Current Infrastructure

| Component | Configuration | Limitation |
|-----------|--------------|------------|
| Queue Server | Single VM | No horizontal scaling |
| Horizon Workers | 10 (default queue) | Fixed capacity |
| Redis | Single instance | Memory/connection limits |
| NetSuite API | External | Rate limited (~10 req/sec) |

### Processing Time Estimates

With 10 workers processing 21,000 jobs:

| Avg Job Duration | Effective Throughput | Total Processing Time |
|------------------|---------------------|----------------------|
| 1 second | 10 jobs/sec | **35 minutes** |
| 2 seconds | 5 jobs/sec | **70 minutes** |
| 3 seconds | 3.3 jobs/sec | **105 minutes** |

### Compound Problem: Overlapping Flows

Multiple flows can run concurrently, multiplying the problem:

| Scenario | Total Jobs on Queue |
|----------|---------------------|
| 1 flow × 7,000 records × 3 nodes | 21,000 |
| 2 overlapping flows | **42,000** |
| 3 overlapping flows | **63,000** |

---

## Bottleneck Identification

### Primary Bottleneck: External API Rate Limits

NetSuite and other external APIs have rate limits that create hard ceilings:
- More workers don't help—they just hit the rate limit faster
- Jobs back up waiting for API capacity

### Secondary Bottleneck: Worker Capacity

With only 10 workers, even local processing (Transform, Map) can back up under load.

### Tertiary Bottleneck: No Adaptive Scaling

Current infrastructure cannot scale based on demand:
- Fixed worker count regardless of queue depth
- No backpressure mechanism to slow intake when overwhelmed

---

## Recommended Solutions

### Short-Term: Per-Flow Throttling (Implement Now)

**Goal**: Prevent any single flow from overwhelming the queue.

**Implementation**:
- Throttle record dispatch at 5-10 records/second
- Configurable per-flow via `pagination_config`
- Adds staggered `delay()` to ProcessNode dispatches

**Benefits**:
- Works with current infrastructure
- Protects external API rate limits
- Reduces queue depth spikes
- Low risk, immediate benefit

**Recommended dispatch rates**:

| Records/Second | 7,000 Records Enter Over | Jobs/Second Into Queue |
|----------------|--------------------------|------------------------|
| 5 | ~23 minutes | 15 (manageable) |
| 10 | ~12 minutes | 30 (moderate load) |
| 20 | ~6 minutes | 60 (risky) |

**Recommendation**: Start with 5 records/second, allow per-flow override.

### Short-Term: Queue Depth-Aware Dispatch

**Goal**: Self-regulating system that responds to actual load.

**Implementation**:
```
Before dispatching batch:
  1. Check Redis queue depth for 'default'
  2. If depth > threshold (e.g., 5,000 jobs), pause dispatch
  3. Wait and retry with exponential backoff
  4. Resume when depth drops below threshold
```

**Benefits**:
- Adaptive to actual system capacity
- Works regardless of number of concurrent flows
- Provides backpressure to prevent runaway growth

### Medium-Term: Dedicated Queues per Connector Type

**Goal**: Isolate slow external APIs from fast local processing.

**Implementation**:
```php
// horizon.php
'netsuite' => [
    'connection' => 'redis',
    'queue' => ['netsuite'],
    'maxProcesses' => 5,  // Match API rate limit
    'tries' => 3,
],
'transform' => [
    'connection' => 'redis',
    'queue' => ['transform'],
    'maxProcesses' => 10,
],
```

**Benefits**:
- NetSuite queue limited to 5 concurrent (matches rate limit)
- Transform/Map can proceed independently
- Bottleneck isolation prevents cascade backup

### Long-Term: Kubernetes Auto-Scaling

**Goal**: Dynamic worker scaling based on demand.

#### What K8s Auto-Scaling Solves

| Problem | K8s Solution |
|---------|--------------|
| Fixed worker count | Scale pods based on queue depth |
| VM CPU/memory limits | More pods = more resources |
| Burst handling | Scale up for spikes, scale down after |
| Single point of failure | Pod replicas, auto-restart |
| Cost efficiency | Pay for capacity when needed |

#### What K8s Auto-Scaling Does NOT Solve

| Problem | Why K8s Doesn't Help |
|---------|---------------------|
| NetSuite rate limits | More workers = hit rate limit faster |
| Heroku API limits | Same ceiling regardless of workers |
| Database connection limits | More workers need more connections |

#### Architecture

```
┌─────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │ Queue Worker│  │ Queue Worker│  │ Queue Worker│ │
│  │   Pod (1)   │  │   Pod (2)   │  │   Pod (N)   │ │
│  └─────────────┘  └─────────────┘  └─────────────┘ │
│         ↑               ↑               ↑          │
│         └───────────────┼───────────────┘          │
│                         │                          │
│              KEDA (Auto-scaler)                    │
│         "Scale when queue depth > 1000"            │
└─────────────────────────────────────────────────────┘
                          │
                    ┌─────┴─────┐
                    │   Redis   │ (Managed, e.g., ElastiCache)
                    └───────────┘
```

**KEDA** (Kubernetes Event-Driven Autoscaling) scales based on:
- Redis queue depth
- Custom metrics (jobs pending, processing time)

#### K8s + Throttling (Recommended Final State)

Even with K8s, throttling remains necessary:
- K8s handles burst capacity and worker scaling
- Throttling protects external API rate limits
- Queue depth-aware dispatch prevents runaway growth

---

## Implementation Priority

| Priority | Solution | Effort | Impact | Status |
|----------|----------|--------|--------|--------|
| **1** | Global throttling | Low | High | ✅ **Implemented** |
| **2** | Queue depth-aware dispatch | Medium | High | Pending |
| **3** | Dedicated queues per connector | Medium | Medium | Pending |
| **4** | Kubernetes migration | High | High | Pending |

---

## Implementation Notes (Priority 1: Global Throttling)

### What Was Implemented

**Global, non-overridable throttling** for all `ProcessNode` job dispatches. Jobs are queued immediately to Redis but with calculated delay times, spreading their release over time.

### Design Decision: Global vs Per-Flow

Throttling is **global and cannot be overridden per-flow** because:

1. **End-users lack insight**: Users don't have visibility into queue depth, worker capacity, or external API rate limits
2. **Consistent behavior**: All flows are treated equally regardless of dataset size
3. **Operational control**: Rate is managed by DevOps via environment variable, not individual flow configurations
4. **Prevents abuse**: A single misconfigured flow cannot overwhelm the queue

### Key Changes

**Files Modified:**

1. **`config/ipaas.php`** (NEW): Global iPaaS configuration file
2. **`src/App/Jobs/ProcessFlowPage.php`**: Throttle logic

**Methods:**

1. **`processRecordsInBatches()`**: Gets throttle rate and passes index to each record dispatch
2. **`processRecord()`**: Calculates delay based on index/rate and applies `->delay()` to dispatched jobs
3. **`getThrottleRate()`**: Returns rate from `config/ipaas.php` (default: 5)

### How It Works

```
ProcessFlowPage receives 7,000 records
         │
         ▼
For each record (index 0-6999):
  - Calculate delay: index / recordsPerSecond
  - Dispatch ProcessNode with ->delay(seconds)
         │
         ▼
All 7,000 jobs queued immediately (fast)
Jobs released from Redis over ~23 minutes (5/sec)
```

**Key benefit**: `ProcessFlowPage` job completes in seconds, not minutes. No long-running process.

### Configuration

Throttle rate is configured **globally** via environment variable or config file.

**Environment Variable:**

```bash
IPAAS_THROTTLE_RECORDS_PER_SECOND=5
```

**Config File** (`config/ipaas.php`):

```php
return [
    'throttle' => [
        'records_per_second' => (float) env('IPAAS_THROTTLE_RECORDS_PER_SECOND', 5),
    ],
];
```

### Key Properties

| Property | Value |
|----------|-------|
| **Rate Control** | DevOps only (env var / config) |
| **Per-Flow Override** | ❌ Not allowed |
| **Default Rate** | 5 records/second |

### Tests

8 unit tests in `tests/Unit/Jobs/ProcessFlowPagePaginationTest.php`:
- Rate reads from global config
- Rate defaults to 5 when not configured
- Per-flow config is ignored
- Rate defaults to 5 when zero or negative
- Delay calculation verification

---

## Configuration Recommendations

### Global Throttle Settings

Set via environment variable in `.env`:

```bash
# iPaaS Throttle Rate (records per second)
# Lower = gentler on queue, higher = faster processing
# Recommended: 5-10 for flows with external API calls
IPAAS_THROTTLE_RECORDS_PER_SECOND=5
```

### Default Values

| Setting | Default | Description |
|---------|---------|-------------|
| `IPAAS_THROTTLE_RECORDS_PER_SECOND` | 5 | Records dispatched per second (global) |

**Note**: `maxQueueDepth` and `pauseOnQueueFull` are listed as Priority 2 (Queue depth-aware dispatch) and are not yet implemented.

---

## Monitoring Recommendations

### Metrics to Track

1. **Queue depth** (per queue): Alert if > 10,000
2. **Job processing time** (p50, p95, p99): Identify slow nodes
3. **Failed jobs rate**: Catch rate limit errors
4. **Worker utilization**: Identify saturation

### Alerting Thresholds

| Metric | Warning | Critical |
|--------|---------|----------|
| Queue depth | 5,000 | 10,000 |
| Job failure rate | 1% | 5% |
| Avg processing time | 5s | 10s |

---

## Related Documentation

- [iPaaS Overview](../../architecture/ipaas/ipass_overview.md)
- [iPaaS Configuration](../../../config/ipaas.php) - Global throttle settings
- [Partition-Based Pagination Handler](../../../src/App/Services/Ipaas/Pagination/Handlers/PartitionBasedHandler.php)
- [Horizon Configuration](../../../config/horizon.php)

---

## Changelog

| Date | Author | Change |
|------|--------|--------|
| 2026-02-02 | AI Analysis | Initial analysis and recommendations |
| 2026-02-02 | AI Implementation | Implemented global throttling (Priority 1) - non-overridable, DevOps controlled via env var |
