Incident Case Studies - Ops Playbook

INC-2024-0847

ADF Pipeline Cascading Failure - March 2024

SEV-1

Detection Time 3 min

Resolution Time 42 min

Affected Services:

Claims Data Mart, Reporting Dashboard, Daily SLA Reports

Case Study Details

Impact Summary:

~2.3M claims records delayed 42 minutes. Downstream reports published late. Customer dashboard 1 hour stale. SLA impact: 0.8% of daily reliability window.

Root Cause:

DPU allocation exhausted due to undocumented parallel activity. Resource contention between incremental load and ad-hoc reporting queries.

Resolution Steps:

1. Scaled IR from 2 to 4 DPUs (15 min) 2. Paused non-critical reporting queries (3 min) 3. Increased ADF activity timeout from 120 to 180 min 4. Restarted pipeline with prioritization (18 min) 5. Validated data consistency post-recovery (6 min)

Lessons Learned & Follow-up:

Implement usage tracking dashboard for ADF IR resources. Document capacity planning assumptions. Add runbook validation step for competing workloads before pipeline execution. Increase standard IR size from 2 to 3 DPUs.

March 05, 2026 - Duration: 0.7 hours

INC-2024-0756

Database Deadlock Under Load - February 2024

SEV-2

Detection Time 2 min

Resolution Time 18 min

Affected Services:

SSIS Standardization Job, Data Lake Writes, Read Replicas

Case Study Details

Impact Summary:

~145K records in limbo state briefly. Downstream processes waited 18 minutes. Recovery validated against raw data source.

Root Cause:

Long-running standardization transaction holding locks on claims and policy tables. Concurrent update process waiting for same tables. Lock wait timeout exceeded.

Resolution Steps:

1. Identified blocking session (2 min) 2. Analyzed lock hierarchy with sp_who2 and sp_locks (5 min) 3. Safely killed long transaction after documenting state (3 min) 4. Reran standardization with updated lock hints (8 min)

Lessons Learned & Follow-up:

Add index on claims.policy_id to reduce scan time. Implement connection pooling timeout settings. Create alert for lock waits > 10 sec. Document lock hierarchy in runbook.

February 13, 2026 - Duration: 0.3 hours

INC-2024-0624

Databricks Cluster Auto-Scale Failure - January 2024

SEV-2

Detection Time 5 min

Resolution Time 32 min

Affected Services:

Data Validation Workflow, ML Feature Engineering, Ad-hoc Analytics

Case Study Details

Impact Summary:

~1.2M validation records pending 32 minutes. ML pipeline features delayed. Analytics queries timeout.

Root Cause:

Auto-scale policy hit node quota limits. Cluster could not add workers. Pending tasks queued indefinitely.

Resolution Steps:

1. Detected quota issue in Databricks API logs (4 min) 2. Contacted cloud ops for quota increase approval (8 min) 3. Manually scaled to max allowed workers as interim (5 min) 4. Once quota increased, resumed auto-scale policy (15 min)

Lessons Learned & Follow-up:

Implement threshold-based pre-scaling before peak hours. Set up Databricks quota alerts. Document cloud ops escalation process. Test multi-region failover strategy.

January 19, 2026 - Duration: 0.5 hours

INC-2024-0421

Vendor Transfer SFTP Timeout - December 2023

SEV-3

Detection Time 10 min

Resolution Time 67 min

Affected Services:

Vendor File Ingestion, Data Lake Daily Batch, Reporting Refresh

Case Study Details

Impact Summary:

Vendor data arrival delayed 67 minutes. Daily batch completion pushed 1.5 hours later. Report SLA impact minimal.

Root Cause:

Network latency to vendor SFTP server spiked. Default timeout of 120 sec insufficient for 1.2 GB file transfer.

Resolution Steps:

1. Observed transfer timeouts in logs (8 min) 2. Verified vendor connectivity and file size (12 min) 3. Updated timeout in automation config from 120 to 300 sec (5 min) 4. Manually resumed transfer with new timeout (42 min)

Lessons Learned & Follow-up:

Implement parallel chunk transfer for large files. Add progress monitoring with streaming logs. Coordinate with vendor on off-peak transfer windows. Monitor MTU settings for network optimization.

November 30, 2025 - Duration: 1.1 hours

INC-2024-0298

Python Healthcheck Script Memory Leak - November 2023

SEV-3

Detection Time 8 min

Resolution Time 23 min

Affected Services:

Healthcheck Alerts, Infrastructure Monitoring, Alert Dashboard

Case Study Details

Impact Summary:

Healthcheck missed 4 cycles. Alerts delayed ~23 minutes. No customer impact (monitoring only).

Root Cause:

Long-running healthcheck accumulated connection objects without cleanup. Memory usage crept to 2GB, script OOM killed.

Resolution Steps:

1. Identified memory growth pattern (6 min) 2. Analyzed script for unclosed connections (9 min) 3. Applied context manager fix and tested locally (5 min) 4. Deployed fixed version and restarted task (3 min)

Lessons Learned & Follow-up:

Implement memory profiling in CI/CD. Add health metrics to script (memory, open connections). Set up memory threshold alerts. Use Python context managers as default pattern.

November 05, 2025 - Duration: 0.4 hours