This is a live output from Pipeline Investigator, an automated root cause analysis
tool I built for diagnosing broken data pipelines. Instead of spending 30-60 minutes manually
searching through logs, staging tables, and production data, this tool runs a battery of
diagnostic checks and delivers a full diagnosis in under 10 seconds.
Important: This report was generated against a fabricated demo database
("Meridian Goods"), a simulated mid-size e-commerce company with 10,000 customers and 50,000+ orders.
Four realistic failure scenarios were injected to showcase the tool's detection capabilities:
A source system renaming a column overnight (schema drift)
A partner integration sending orders without customer IDs (null flood)
A network timeout causing triple-ingestion of the same batch (duplicate surge)
A payments provider sending dollar amounts as text like "$142.50" (silent data corruption)
The tool is built entirely in Python with zero external dependencies (stdlib only).
Everything below was generated automatically — the root cause classifications, the
diagnostic evidence, and the fix recommendations.
2 CRITICAL2 HIGH4 issues found• ~175 min saved vs. manual debugging
This report was generated automatically by analyzing pipeline run history, staging data, and production tables.
We found 4 issues that need attention. Each issue below includes a plain-English explanation,
the technical root cause, diagnostic evidence, and specific steps to fix it.
CRITICALIssue #1: Column Mapping Broke
Pipeline: daily_orders_etlStage: Data Transformation
Needs immediate attention — data is being lost or corrupted
What Happened
The system that sends us order data changed its format without warning. Our pipeline expected a column called "total" but it was renamed to something else. As a result, every order in this batch came in without a dollar amount.
Technical Root Cause
Source system schema changed (column renamed/added). The ETL column mapping broke, causing NULL values in the 'total' field for all ingested rows. The transform stage crashed on the missing expected column name.
What We Checked
5 automated diagnostic checks were run against the pipeline data. Click any check to see the raw evidence.
FAIL
Missing Values Check
Scanned incoming data for empty fields that should never be empty.
350 NULL values found across 2 columns. Worst: null_total (200 NULLs out of 925 rows).
Missing value breakdownnull_customer_id: 150, null_total: 200Total incoming rows925
FAIL
Duplicate Records Check
Checked whether the same order appeared more than once in incoming data.
20 order IDs appear multiple times. Worst: 3 copies of a single order. 40 excess rows total.
Orders with duplicates20Most copies of one order3Extra rows40Example orderORD-DUP-0100Retry timestamps02:00:00, 02:00:30, 02:01:05Caused by retryYes
FAIL
Data Format Check
Verified that dollar amounts are stored as numbers, not formatted text.
20 rows have non-numeric "total" values that would silently cast to $0.00.
Bad rows20Example values$1,229.34 · $1,272.63 · $1,283.63 · $1,336.73 · $1,354.71PatternCurrency symbols, commas, or text prefixes in numeric field
FAIL
Column Mapping Check
Checked whether all expected data columns are present and populated.
200 rows have NULL totals — consistent with a column mapping failure from schema drift.
Rows missing totals200Pipeline errorKeyError: 'total' — column not found in source payload. Expected: [order_id, customer_id, order_date, status, payment_method, total]. Received: [order_id, customer_id, order_date, status, payment_method, order_total, currency].Likely causeSource schema change — column renamed or added
FAIL
Production Impact Check
Checked if bad data made it into the live production database.
175 production orders show $0.00. Estimated revenue undercount: $25,158.
Add schema validation to the ingest stage — compare incoming columns against expected schema before processing.
Implement a schema registry that tracks source system schemas and alerts on drift.
Add a pre-flight check: compare source table columns to expected column list before ingestion.
Set up source system change notifications with the POS team to catch upstream changes before they break the pipeline.
This automated diagnosis saves roughly 45 minutes compared to manual investigation.
HIGHIssue #2: Missing Customer Data
Pipeline: daily_orders_etlStage: Database Loading
Should be fixed soon — the pipeline is blocked
What Happened
One of our data sources started sending orders without customer information attached. Our validation checks didn't catch it because they only look at dollar amounts, not customer IDs. The pipeline failed when it tried to save these incomplete orders to the production database.
Technical Root Cause
Upstream data source (wholesale channel) began sending orders without customer_id. Ingestion and validation stages did not catch the NULLs. The load stage failed on a foreign key constraint when attempting to insert into the production orders table.
How to Fix This
Add NOT NULL validation for customer_id in the validate stage — currently only checks 'total'.
Implement a data quality gate: reject batches where >5% of critical fields are NULL.
Add source-specific validation rules — wholesale channel should map to a default wholesale customer if customer_id is missing.
Set up alerting on NULL rate spikes per column per source system.
This automated diagnosis saves roughly 30 minutes compared to manual investigation.
HIGHIssue #3: Duplicate Orders
Pipeline: daily_orders_etlStage: Data Validation
Should be fixed soon — the pipeline is blocked
What Happened
A network hiccup caused the same batch of orders to be sent multiple times. Our system should have caught the duplicates, but a bug in the deduplication logic let them slip through because each copy had a slightly different timestamp.
Technical Root Cause
Network timeout triggered retry mechanism, causing the ingestion batch to be sent 3 times. The dedup logic failed because it keys on (order_id, order_date) but retried rows have different ingested_at timestamps. 40 excess rows in staging.
How to Fix This
Fix the dedup logic: key on order_id alone (not order_id + order_date), or use a hash of the full row.
Add idempotency keys to the ingestion layer — reject rows where order_id already exists in staging for the same run_date.
Implement a batch receipt log: if a batch ID was already processed, skip the retry.
Add row count validation: if today's batch is >2x the trailing 7-day average, pause and alert.
This automated diagnosis saves roughly 40 minutes compared to manual investigation.
CRITICALIssue #4: Data Format Corruption
Pipeline: daily_orders_etlStage: After Loading (Silent Failure)
Needs immediate attention — data is being lost or corrupted
What Happened
A payment provider started sending dollar amounts as text (like "$142.50" or "USD 89.00") instead of plain numbers. The database silently converted these to $0.00 without raising any errors. The pipeline looked completely healthy, but revenue data was wrong. This is the most dangerous type of failure because nothing appears broken.
Technical Root Cause
New payments provider sends totals as formatted strings with currency symbols and commas ($142.50, USD 89.00, 1,234.56). SQLite's CAST(total AS REAL) silently converts these to 0.0, causing revenue data loss. Pipeline reports success — this is a silent data corruption issue.
How to Fix This
Add type validation in the validate stage: attempt CAST and reject rows that produce 0.0 from non-zero input strings.
Implement a cleaning function: strip currency symbols ($, USD, €), remove commas, then cast. Apply before the transform stage.
Add a revenue sanity check post-load: compare today's loaded revenue to trailing average. Alert if deviation >30%.
Coordinate with the payments provider to enforce numeric-only format in their API contract.
This automated diagnosis saves roughly 60 minutes compared to manual investigation.