Runbook Library - Ops Playbook

ADF Pipeline Delayed Execution

ADF • SEV-2

SEV-2

Handles delays in Azure Data Factory incremental loads

View Procedure

Detection Steps:

1. Check ADF portal for failed/pending activities 2. Verify data source availability 3. Monitor runtime metrics in Log Analytics

Diagnostics:

1. Query source database connection status 2. Check storage account accessibility 3. Review Data Factory execution logs in Kusto 4. Validate copy activity parallelization settings

Resolution:

1. Increase DPU allocation to ADF IR if CPU throttled 2. Retry failed pipeline with increased timeout 3. If data source slow: optimize source queries 4. Restart Integration Runtime if unresponsive 5. Monitor next execution for stability

Rollback:

1. Revert DPU allocation to baseline 2. Document performance metrics 3. Run validation queries to ensure data consistency 4. Resume normal monitoring

Est. MTTR 15-25 min

Success Rate 82% 23/28 uses

SSIS Claims Standardization Failure

SSIS • SEV-2

SEV-2

Resolution for SSIS package standardization errors

View Procedure

Detection Steps:

1. Monitor SQL Agent job history 2. Check error code in SSISDB catalog 3. Verify memory pressure on execution server

Diagnostics:

1. Query SSISDB for execution logs and warnings 2. Check SQL Server memory and CPU 3. Validate table locks in claims database 4. Review external API response logs

Resolution:

1. Clear SSIS object cache if aged 2. Optimize transformation script buffer sizes 3. If database locked: identify blocking session and kill if safe 4. Restart SQL Server Agent job 5. Execute full validation on sample 10K records

Rollback:

1. Verify rollback integrity with spot checks 2. Run validation queries on standardized fields 3. Monitor downstream dependent jobs 4. Alert data consumers if needed

Est. MTTR 20-30 min

Success Rate 75% 18/24 uses

Databricks Validation Sweep Timeout

Databricks • SEV-3

SEV-3

Handle Databricks cluster scaling and query timeout issues

View Procedure

Detection Steps:

1. Check Databricks cluster state in console 2. Monitor job run in Workflows UI 3. Check Spark driver/executor logs

Diagnostics:

1. Analyze query execution plan in Databricks SQL 2. Check cluster auto-scaling metrics 3. Review data file sizes being scanned 4. Monitor network throughput to data lake

Resolution:

1. Scale cluster from 4 to 8 workers 2. Increase Spark shuffle partitions from 200 to 500 3. Optimize SQL query with column selection 4. Add data filter predicate if applicable 5. Retry workflow job

Rollback:

1. Scale cluster back to 4 workers 2. Reset Spark config to defaults 3. Verify data consistency with checksums 4. Document performance baseline

Est. MTTR 10-15 min

Success Rate 88% 31/35 uses

Python Healthcheck Runner Script Error

Python • SEV-3

SEV-3

Debug and fix Python healthcheck script failures

View Procedure

Detection Steps:

1. Check scheduled task execution log 2. Review stderr output from last run 3. Check process CPU and memory usage

Diagnostics:

1. Run script in debug mode with verbose logging 2. Validate all external API endpoints 3. Check Python environment (pip list) 4. Verify database connection string validity 5. Check disk space on execution server

Resolution:

1. Update pip packages (pip install --upgrade) 2. Fix timeout values for slow endpoints 3. Add retry logic with exponential backoff 4. Validate all environment variables 5. Test locally before re-enabling schedule

Rollback:

1. Revert pip packages to previous versions 2. Restore previous script version from git 3. Run validation against test dataset 4. Restore monitoring alerts

Est. MTTR 5-15 min

Success Rate 86% 45/52 uses

Fortra Vendor Transfer Delay

Automation • SEV-2

SEV-2

Handle delayed file transfers from Fortra automation

View Procedure

Detection Steps:

1. Check Fortra job status in their portal 2. Monitor local staging directory for files 3. Check FTP/SFTP connection logs

Diagnostics:

1. Verify Fortra API authentication and rate limits 2. Check network connectivity to Fortra servers 3. Monitor disk space on staging area 4. Review firewall rules for Fortra IP ranges

Resolution:

1. Contact Fortra support to check remote job status 2. Trigger manual re-execution if safe 3. Clear stuck locks on staging files 4. Restore network connectivity if needed 5. Resume polling for file arrival

Rollback:

1. Validate file checksums if re-transferred 2. Replay missing data through transformation 3. Monitor for duplicate processing 4. Clear staging area of failed attempts

Est. MTTR 25-40 min

Success Rate 82% 19/23 uses