# ProcessFlow Metrics Management & Failure Classification Design

**Version**: 1.0
**Date**: September 10, 2025
**Author**: Quinn Johns (w/Cursor)

## How to Use This Document

This design document contains different types of code examples with specific implementation intents:

### 🔧 Ready-to-Use Code

- **ProcessFlow Class Methods**: Complete method implementations - integrate directly into existing ProcessFlow class
- **Redis Key Patterns**: Production-ready key structures - use these exact patterns for consistency
- **Cleanup Service Methods**: Full cleanup logic - add to CleanupFlowDataService or create new service
- **Exception Classes**: Complete exception definitions - create these classes in your exceptions directory

### 📋 Architectural Contracts

- **Metrics Management Interface**: Method signatures with detailed behavior - implement the core logic following these patterns
- **State Management Patterns**: Redis operations and key structures - follow these approaches for consistency
- **Event Dispatching**: Flow completion/failure events - integrate with existing event system
- **Job Release Logic**: Retry and backoff strategies - adapt to your queue configuration

### 🧪 Development Guidance

- **Test Examples**: Complete test structure using Pest - guide your testing approach and coverage
- **Error Classification**: Exception handling patterns - reference for robust error management
- **Monitoring Patterns**: Logging and metrics collection - follow these approaches for observability
- **Performance Optimization**: Redis operations and TTL management - reference for efficient implementation

### 🚨 Critical Implementation Notes

- **Backward Compatibility**: All changes must maintain existing ProcessFlow behavior
- **Redis Key Consistency**: Use exact key patterns to prevent conflicts with existing systems
- **Job Queue Integration**: Ensure compatibility with Laravel job release mechanisms
- **Multi-Tenant Safety**: All Redis operations must include tenant isolation

### Implementation Priority

1. **Implement core ProcessFlow methods** (initializeFlowMetrics, cleanupCurrentJobMetrics)
2. **Add exception classes** for proper error classification
3. **Create cleanup service methods** for orphaned metrics management
4. **Follow testing patterns** to ensure reliability
5. **Add monitoring and logging** following the established patterns

### Code Section Guide

- **`### Implementation Example`** = Copy and adapt these methods directly
- **`### Architecture Pattern`** = Follow this structure, implement your business logic
- **`### Testing Strategy`** = Use as template for your test cases
- **`### Monitoring Example`** = Reference for logging and metrics patterns

## Document Overview

**Purpose**: Design comprehensive metrics management for ProcessFlow jobs that handles job releases, retry scenarios, and proper failure classification.

**Target**: Developer implementing metrics cleanup and flow failure tracking

**Scope**: ProcessFlow job metrics lifecycle, cleanup strategies, and failure state management

---

## Problem Statement

### Current Issues

1. **Orphaned Metrics**: Laravel job releases generate new job IDs, leaving old metrics unreachable
2. **Inconsistent Failure Classification**: No distinction between API failures vs successful requests with no data
3. **Metric Pollution**: Accumulating unused metrics over time without cleanup
4. **Lost Context**: Retry attempts lose connection to original flow metrics
5. **Incomplete Monitoring**: Cannot track flow success/failure accurately across job retries

### Business Impact

- **Monitoring Gaps**: Cannot accurately measure flow success rates
- **Resource Waste**: Redis memory consumption from orphaned metrics
- **Debugging Difficulty**: Lost context across job retries makes troubleshooting complex
- **Operational Blindness**: No clear visibility into flow health and failure patterns

---

## Requirements

### Functional Requirements

1. **Persistent Flow Context**: Metrics survive job releases and Laravel retries
2. **Accurate Failure Classification**: Distinguish between different failure types
3. **Automatic Cleanup**: Remove orphaned and expired metrics
4. **Context Restoration**: Restore flow state when job retries
5. **Comprehensive Logging**: Track all metric operations for debugging

### Non-Functional Requirements

1. **Performance**: Metric operations should not impact job execution time
2. **Reliability**: Metric cleanup should not interfere with active flows
3. **Scalability**: Handle high-volume metric operations efficiently
4. **Maintainability**: Clear separation between job-level and flow-level metrics

---

## Architecture Overview

### Metric Key Strategy

```
Flow-Level Keys (Persistent across job retries):
├── flow:{flowId}:run:{runId}:status
├── flow:{flowId}:run:{runId}:api_attempts
├── flow:{flowId}:run:{runId}:api_failures
├── flow:{flowId}:run:{runId}:last_error
├── flow:{flowId}:run:{runId}:retry_history
└── flow:{flowId}:run:{runId}:completion_status

Job-Level Keys (Temporary, cleaned on release):
├── flow:{flowId}:run:{runId}:job:{jobId}:start_time
├── flow:{flowId}:run:{runId}:job:{jobId}:metrics
└── flow:{flowId}:run:{runId}:job:{jobId}:performance
```

### State Machine

```mermaid
stateDiagram-v2
    [*] --> Initializing
    Initializing --> Active : First attempt
    Initializing --> Restoring : Retry detected
    Restoring --> Active : Context restored
    Active --> ApiRetrying : API errors
    Active --> CompletedNoData : Success, no records
    Active --> Processing : Success, has records
    ApiRetrying --> ReleasingToQueue : Retries exhausted
    ReleasingToQueue --> [*] : Job released
    Active --> Failed : Non-retriable error
    Processing --> Completed : Flow finished
    Failed --> [*] : Cleanup complete
    Completed --> [*] : Cleanup complete
    CompletedNoData --> [*] : Cleanup complete
```

---

## Implementation Details

### 1. ProcessFlow Class Modifications

#### Properties and Configuration

```php
class ProcessFlow implements ShouldQueue
{
    public $tries = 5;           // Maximum Laravel job retries
    public $backoff = [60, 180, 300, 600]; // Progressive delays
    public $timeout = 300;       // 5-minute job timeout

    // Add new properties for metrics management
    private string $flowMetricsKey;
    private string $currentJobId;
    private bool $isRetryAttempt = false;

    public function __construct($flow_id, $node, $tenantDatabase, $runId = null)
    {
        // Existing constructor logic...

        // Initialize metrics key
        $this->flowMetricsKey = "flow:{$this->flow_id}:run:{$this->runId}";
    }
}
```

#### Main Handle Method Enhancement

```php
public function handle()
{
    try {
        setupTenantConnection($this->tenantDatabase);

        // NEW: Initialize or restore flow metrics
        $this->initializeFlowMetrics();

        // Execute the first API node with comprehensive error handling
        $result = $this->executeFirstApiNodeWithRetry();

        // Handle different response scenarios
        $this->handleApiResponse($result);

    } catch (ApiRetryExhaustedException $e) {
        $this->handleApiRetryExhaustion($e);
    } catch (NonRetriableApiException $e) {
        $this->handleNonRetriableError($e);
    } catch (\Exception $e) {
        $this->handleUnexpectedError($e);
    }
}
```

### 2. Metrics Initialization and Restoration

```php
/**
 * Initialize flow metrics for first attempt or restore context for retries
 */
private function initializeFlowMetrics(): void
{
    $this->currentJobId = $this->job->getJobId();

    // Check if this is a retry (existing flow metrics)
    $existingStatus = Redis::hget($this->flowMetricsKey, 'status');

    if ($existingStatus && $existingStatus !== 'completed' && $existingStatus !== 'failed') {
        // This is a retry - restore context
        $this->restoreFlowContext($existingStatus);
    } else {
        // First attempt - initialize fresh metrics
        $this->initializeFreshMetrics();
    }
}

/**
 * Restore flow context for retry attempts
 */
private function restoreFlowContext(string $previousStatus): void
{
    $this->isRetryAttempt = true;
    $retryAttempt = (int) Redis::hget($this->flowMetricsKey, 'retry_attempt') ?: 0;
    $retryAttempt++;

    Log::info('🔄 ProcessFlow retry detected, restoring context', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'new_job_id' => $this->currentJobId,
        'retry_attempt' => $retryAttempt,
        'previous_status' => $previousStatus
    ]);

    // Update status to active with new job ID
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'active',
        'current_job_id' => $this->currentJobId,
        'retry_attempt' => $retryAttempt,
        'restarted_at' => now()->toIso8601String()
    ]);

    // Clean up any previous job metrics
    $this->cleanupPreviousJobMetrics();
}

/**
 * Initialize fresh metrics for first attempt
 */
private function initializeFreshMetrics(): void
{
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'active',
        'started_at' => now()->toIso8601String(),
        'current_job_id' => $this->currentJobId,
        'retry_attempt' => 0,
        'api_attempts' => 0,
        'api_failures' => 0
    ]);

    Redis::expire($this->flowMetricsKey, 86400); // 24 hour TTL

    Log::info('🚀 ProcessFlow metrics initialized', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'job_id' => $this->currentJobId
    ]);
}
```

### 3. API Response Handling

```php
/**
 * Handle different API response scenarios
 */
private function handleApiResponse($apiResponse): void
{
    if ($this->isErrorResponse($apiResponse)) {
        throw new ApiRetryExhaustedException('API returned error after retries');
    }

    $recordCount = $this->extractRecordCount($apiResponse);

    if ($recordCount === 0) {
        // Success with no data - valid outcome
        $this->markFlowAsCompletedWithNoData();
    } else {
        // Success with data - continue normal processing
        $this->markApiSuccess($recordCount);
        $this->processApiResponse($apiResponse);
    }
}

/**
 * Mark flow as completed successfully with no data
 */
private function markFlowAsCompletedWithNoData(): void
{
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'completed_no_data',
        'completed_at' => now()->toIso8601String(),
        'records_processed' => 0,
        'api_success' => true
    ]);

    // Update global success counters
    Redis::incr("metrics:process_flow:completed:no_data");

    Log::info('✅ ProcessFlow completed successfully with no data', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'retry_attempts' => Redis::hget($this->flowMetricsKey, 'retry_attempt')
    ]);

    // Dispatch flow completion event
    FlowCompleted::dispatch($this->flow_id, $this->runId, 'no_data');

    // Clean up all metrics
    $this->performFinalMetricCleanup();
}

/**
 * Mark successful API call with data
 */
private function markApiSuccess(int $recordCount): void
{
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'processing',
        'api_success' => true,
        'records_found' => $recordCount,
        'api_success_at' => now()->toIso8601String()
    ]);

    // Update global success counters
    Redis::incr("metrics:process_flow:api_success");
    Redis::incrby("metrics:process_flow:records_found", $recordCount);
}
```

### 4. Error Handling and Job Release

```php
/**
 * Handle API retry exhaustion - release job to queue
 */
private function handleApiRetryExhaustion(ApiRetryExhaustedException $e): void
{
    // Update persistent flow-level metrics
    $this->updateFlowRetryMetrics($e);

    // Clean up current job metrics
    $this->cleanupCurrentJobMetrics();

    // Calculate appropriate retry delay
    $retryDelay = $this->calculateJobRetryDelay($e->getErrorClassification());

    Log::warning('🔄 ProcessFlow releasing to queue, metrics cleaned', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'old_job_id' => $this->currentJobId,
        'retry_attempt' => Redis::hget($this->flowMetricsKey, 'retry_attempt'),
        'retry_delay' => $retryDelay,
        'error_type' => $e->getErrorClassification()
    ]);

    $this->release($retryDelay);
}

/**
 * Update flow metrics when API retries are exhausted
 */
private function updateFlowRetryMetrics(ApiRetryExhaustedException $e): void
{
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'retrying_in_queue',
        'last_api_error_type' => $e->getErrorClassification(),
        'last_api_error_message' => $e->getLastError(),
        'api_attempts_made' => $e->getAttemptsMade(),
        'last_retry_at' => now()->toIso8601String()
    ]);

    Redis::incr("{$this->flowMetricsKey}:total_api_failures");
    Redis::expire($this->flowMetricsKey, 86400); // Extend TTL
}

/**
 * Calculate job retry delay based on error type
 */
private function calculateJobRetryDelay(string $errorType): int
{
    $delays = [
        'concurrency_errors' => 60,  // 1 minute - quick retry
        'rate_limit_errors' => 300,  // 5 minutes - respect limits
        'service_unavailable' => 900, // 15 minutes - service issues
        'network_timeout' => 180,    // 3 minutes - network issues
        'default' => 180             // 3 minutes - fallback
    ];

    return $delays[$errorType] ?? $delays['default'];
}
```

### 5. Metrics Cleanup Methods

```php
/**
 * Clean up current job-specific metrics before job release
 */
private function cleanupCurrentJobMetrics(): void
{
    $jobMetricsPattern = "flow:{$this->flow_id}:run:{$this->runId}:job:{$this->currentJobId}:*";

    $jobMetricKeys = Redis::keys($jobMetricsPattern);

    if (!empty($jobMetricKeys)) {
        Redis::del($jobMetricKeys);

        Log::debug('🧹 Cleaned up job metrics before release', [
            'flow_id' => $this->flow_id,
            'job_id' => $this->currentJobId,
            'keys_cleaned' => count($jobMetricKeys)
        ]);
    }
}

/**
 * Clean up metrics from previous job attempts
 */
private function cleanupPreviousJobMetrics(): void
{
    // Get all job metric keys for this flow/run
    $allJobMetricsPattern = "flow:{$this->flow_id}:run:{$this->runId}:job:*";
    $allJobKeys = Redis::keys($allJobMetricsPattern);

    if (!empty($allJobKeys)) {
        // Filter out current job metrics
        $currentJobPattern = "flow:{$this->flow_id}:run:{$this->runId}:job:{$this->currentJobId}:";
        $previousJobKeys = array_filter($allJobKeys, function($key) use ($currentJobPattern) {
            return !str_contains($key, $currentJobPattern);
        });

        if (!empty($previousJobKeys)) {
            Redis::del($previousJobKeys);

            Log::debug('🧹 Cleaned up previous job metrics', [
                'flow_id' => $this->flow_id,
                'current_job_id' => $this->currentJobId,
                'keys_cleaned' => count($previousJobKeys)
            ]);
        }
    }
}

/**
 * Perform final cleanup when flow completes or fails
 */
private function performFinalMetricCleanup(): void
{
    // Clean up all job-specific metrics
    $allJobMetricsPattern = "flow:{$this->flow_id}:run:{$this->runId}:job:*";
    $jobKeys = Redis::keys($allJobMetricsPattern);

    if (!empty($jobKeys)) {
        Redis::del($jobKeys);
    }

    // Set shorter TTL on flow metrics for final cleanup
    Redis::expire($this->flowMetricsKey, 3600); // 1 hour for analysis

    Log::info('🧹 Final metric cleanup completed', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'job_keys_cleaned' => count($jobKeys)
    ]);
}
```

### 6. Job Failure Handling

```php
/**
 * Handle job failure after all Laravel retries exhausted
 */
public function failed(\Throwable $exception): void
{
    // Determine failure type
    $failureType = $this->classifyFlowFailure($exception);

    // Update flow metrics with final failure status
    $this->markFlowAsFailed($failureType, $exception);

    // Clean up all associated metrics
    $this->performFinalMetricCleanup();

    // Notify monitoring systems
    $this->notifyFlowFailure($failureType, $exception);
}

/**
 * Classify the type of flow failure
 */
private function classifyFlowFailure(\Throwable $exception): string
{
    if ($exception instanceof ApiRetryExhaustedException) {
        return 'api_failure'; // API requests failed after retries
    } elseif ($exception instanceof NonRetriableApiException) {
        return 'api_non_retriable'; // 401/403/404 errors
    } elseif ($exception instanceof \Illuminate\Queue\TimeoutExceededException) {
        return 'job_timeout'; // Job execution timeout
    } elseif ($exception instanceof \Illuminate\Queue\MaxAttemptsExceededException) {
        return 'max_attempts'; // Laravel retry limit reached
    } else {
        return 'system_error'; // Unexpected errors
    }
}

/**
 * Mark flow as failed with comprehensive metrics
 */
private function markFlowAsFailed(string $failureType, \Throwable $exception): void
{
    Redis::hmset($this->flowMetricsKey, [
        'status' => 'failed',
        'failure_type' => $failureType,
        'failure_reason' => $exception->getMessage(),
        'failed_at' => now()->toIso8601String(),
        'total_job_attempts' => $this->attempts(),
        'final_job_id' => $this->currentJobId
    ]);

    // Update global failure counters
    Redis::incr("metrics:process_flow:failures:total");
    Redis::incr("metrics:process_flow:failures:by_type:{$failureType}");

    Log::error('❌ ProcessFlow marked as failed', [
        'flow_id' => $this->flow_id,
        'run_id' => $this->runId,
        'failure_type' => $failureType,
        'job_attempts' => $this->attempts(),
        'retry_attempts' => Redis::hget($this->flowMetricsKey, 'retry_attempt'),
        'exception' => $exception->getMessage()
    ]);

    // Dispatch flow failure event
    FlowFailed::dispatch($this->flow_id, $this->runId, $failureType, $exception->getMessage());
}
```

### 7. Cleanup Service for Orphaned Metrics

```php
/**
 * Service to clean up orphaned and expired metrics
 */
class FlowMetricsCleanupService
{
    /**
     * Clean up orphaned and expired metrics
     */
    public function cleanupOrphanedMetrics(): array
    {
        $results = [
            'expired_flows' => 0,
            'orphaned_jobs' => 0,
            'total_keys_cleaned' => 0
        ];

        // Clean up expired flow metrics
        $results['expired_flows'] = $this->cleanupExpiredFlowMetrics();

        // Clean up orphaned job metrics
        $results['orphaned_jobs'] = $this->cleanupOrphanedJobMetrics();

        // Clean up global metrics older than 30 days
        $this->cleanupOldGlobalMetrics();

        $results['total_keys_cleaned'] = $results['expired_flows'] + $results['orphaned_jobs'];

        Log::info('🧹 Flow metrics cleanup completed', $results);

        return $results;
    }

    /**
     * Find and clean up expired flow metrics
     */
    private function cleanupExpiredFlowMetrics(): int
    {
        $cleanedCount = 0;
        $flowPattern = "flow:*:run:*";
        $flowKeys = Redis::keys($flowPattern);

        foreach ($flowKeys as $flowKey) {
            // Skip if key has job or counter suffix
            if (str_contains($flowKey, ':job:') || str_contains($flowKey, ':total_')) {
                continue;
            }

            $status = Redis::hget($flowKey, 'status');
            $completedAt = Redis::hget($flowKey, 'completed_at');
            $failedAt = Redis::hget($flowKey, 'failed_at');

            // Clean up completed/failed flows older than 24 hours
            if (($status === 'completed' || $status === 'failed' || $status === 'completed_no_data')) {
                $endTime = $completedAt ?: $failedAt;
                if ($endTime && Carbon::parse($endTime)->addDay()->isPast()) {
                    $this->cleanupFlowMetrics($flowKey);
                    $cleanedCount++;
                }
            }

            // Clean up abandoned flows (no status update in 4 hours)
            $lastUpdate = Redis::hget($flowKey, 'restarted_at') ?: Redis::hget($flowKey, 'started_at');
            if ($lastUpdate && Carbon::parse($lastUpdate)->addHours(4)->isPast()) {
                $this->cleanupFlowMetrics($flowKey);
                $cleanedCount++;
            }
        }

        return $cleanedCount;
    }

    /**
     * Clean up all metrics associated with a flow
     */
    private function cleanupFlowMetrics(string $flowKey): void
    {
        // Extract flow ID and run ID from key
        if (preg_match('/flow:(.+):run:(.+)$/', $flowKey, $matches)) {
            $flowId = $matches[1];
            $runId = $matches[2];

            // Find all related keys
            $relatedPattern = "flow:{$flowId}:run:{$runId}:*";
            $relatedKeys = Redis::keys($relatedPattern);
            $relatedKeys[] = $flowKey; // Include the main flow key

            if (!empty($relatedKeys)) {
                Redis::del($relatedKeys);
            }
        }
    }

    /**
     * Clean up orphaned job metrics
     */
    private function cleanupOrphanedJobMetrics(): int
    {
        $cleanedCount = 0;
        $jobPattern = "flow:*:run:*:job:*";
        $jobKeys = Redis::keys($jobPattern);

        foreach ($jobKeys as $jobKey) {
            // Extract flow and run from job key
            if (preg_match('/flow:(.+):run:(.+):job:/', $jobKey, $matches)) {
                $flowKey = "flow:{$matches[1]}:run:{$matches[2]}";

                // Check if parent flow still exists
                if (!Redis::exists($flowKey)) {
                    Redis::del($jobKey);
                    $cleanedCount++;
                }
            }
        }

        return $cleanedCount;
    }
}
```

---

## Testing Strategy

### Unit Tests

```php
describe('ProcessFlow Metrics Management', function () {
    beforeEach(function () {
        Redis::spy();
        config(['database.default' => 'tenant_connection']);
    });

    it('initializes fresh metrics for first attempt', function () {
        $processFlow = new ProcessFlow($flowId, $node, $tenantDb, $runId);
        $processFlow->initializeFlowMetrics();

        Redis::shouldHaveReceived('hmset')
            ->with("flow:{$flowId}:run:{$runId}", Mockery::subset([
                'status' => 'active',
                'retry_attempt' => 0
            ]));
    });

    it('restores context for retry attempts', function () {
        // Mock existing flow metrics
        Redis::shouldReceive('hget')
            ->with("flow:{$flowId}:run:{$runId}", 'status')
            ->andReturn('retrying_in_queue');

        $processFlow->initializeFlowMetrics();

        expect($processFlow->isRetryAttempt)->toBeTrue();
    });

    it('cleans up job metrics before release', function () {
        Redis::shouldReceive('keys')
            ->with("flow:{$flowId}:run:{$runId}:job:{$jobId}:*")
            ->andReturn(['key1', 'key2']);

        Redis::shouldReceive('del')
            ->with(['key1', 'key2']);

        $processFlow->cleanupCurrentJobMetrics();
    });

    it('classifies different failure types correctly', function () {
        $apiException = new ApiRetryExhaustedException('API failed');
        expect($processFlow->classifyFlowFailure($apiException))->toBe('api_failure');

        $timeoutException = new TimeoutExceededException('Timeout');
        expect($processFlow->classifyFlowFailure($timeoutException))->toBe('job_timeout');
    });
});
```

### Integration Tests

```php
describe('ProcessFlow Metrics Integration', function () {
    beforeEach(function () {
        setupRedisForTesting();
    });

    it('maintains metrics consistency across job retries', function () {
        // First job attempt
        $processFlow1 = new ProcessFlow($flowId, $node, $tenantDb, $runId);
        $processFlow1->handle(); // Simulate API failure

        // Second job attempt (retry)
        $processFlow2 = new ProcessFlow($flowId, $node, $tenantDb, $runId);
        $processFlow2->handle(); // Should restore context

        $retryAttempt = Redis::hget("flow:{$flowId}:run:{$runId}", 'retry_attempt');
        expect((int)$retryAttempt)->toBeGreaterThan(0);
    });

    it('performs complete cleanup on flow completion', function () {
        $processFlow = new ProcessFlow($flowId, $node, $tenantDb, $runId);
        $processFlow->markFlowAsCompletedWithNoData();

        // Check that job metrics are cleaned up
        $jobKeys = Redis::keys("flow:{$flowId}:run:{$runId}:job:*");
        expect($jobKeys)->toBeEmpty();

        // Check that flow metrics have shorter TTL
        $ttl = Redis::ttl("flow:{$flowId}:run:{$runId}");
        expect($ttl)->toBeLessThanOrEqual(3600);
    });
});
```

---

## Implementation Plan

### Phase 1: Core Metrics Infrastructure
- [ ] Implement `initializeFlowMetrics()` and `restoreFlowContext()`
- [ ] Add metric key management methods
- [ ] Implement basic cleanup methods
- [ ] Update ProcessFlow constructor and handle method

### Phase 2: Error Handling Integration
- [ ] Implement `handleApiRetryExhaustion()` method
- [ ] Add failure classification logic
- [ ] Integrate with existing ApiRetryService
- [ ] Implement job release with cleanup

### Phase 3: Completion and Cleanup
- [ ] Implement `markFlowAsCompletedWithNoData()` method
- [ ] Add final cleanup methods
- [ ] Implement `FlowMetricsCleanupService`
- [ ] Add scheduled cleanup task

### Phase 4: Testing and Monitoring
- [ ] Write comprehensive unit tests
- [ ] Implement integration tests
- [ ] Add performance monitoring
- [ ] Create operational dashboards

---

## Success Criteria

### Functional Success
- [ ] **Metric Persistence**: Flow metrics survive job releases and retries
- [ ] **Accurate Classification**: Clear distinction between failure types
- [ ] **Automatic Cleanup**: No metric pollution or memory leaks
- [ ] **Context Restoration**: Retry attempts maintain flow context

### Performance Success
- [ ] **Response Time**: Metric operations add <50ms to job execution
- [ ] **Memory Usage**: Redis memory growth stays within 5% increase
- [ ] **Cleanup Efficiency**: Cleanup service processes 1000+ metrics/minute

### Operational Success
- [ ] **Monitoring Clarity**: Clear flow success/failure rates in dashboards
- [ ] **Debugging Capability**: Full metric history available for troubleshooting
- [ ] **Alerting Accuracy**: Alerts trigger only for actual failures

---

## Monitoring and Alerting

### Key Metrics to Track
- `metrics:process_flow:completed:total` - Total successful flows
- `metrics:process_flow:completed:no_data` - Successful flows with no data
- `metrics:process_flow:failures:total` - Total failed flows
- `metrics:process_flow:failures:by_type:{type}` - Failures by classification
- `metrics:process_flow:api_success` - API call success rate
- `metrics:process_flow:metrics_cleaned` - Cleanup operation effectiveness

### Recommended Alerts
- **High Failure Rate**: >10% failure rate in 15-minute window
- **Metric Cleanup Issues**: Cleanup service failing for >1 hour
- **Memory Growth**: Redis memory growth >20% in 24 hours
- **Long-Running Flows**: Flows active for >4 hours without completion

