Portfolio Demo

About This Report

This is a live output from Pipeline Investigator, an automated root cause analysis tool I built for diagnosing broken data pipelines. Instead of spending 30-60 minutes manually searching through logs, staging tables, and production data, this tool runs a battery of diagnostic checks and delivers a full diagnosis in under 10 seconds.

Important: This report was generated against a fabricated demo database ("Meridian Goods"), a simulated mid-size e-commerce company with 10,000 customers and 50,000+ orders. Four realistic failure scenarios were injected to showcase the tool's detection capabilities:

The tool is built entirely in Python with zero external dependencies (stdlib only). Everything below was generated automatically — the root cause classifications, the diagnostic evidence, and the fix recommendations.

View source code on GitHub →

Pipeline Investigation Report

Pipeline Investigator v1.0 — 2026-03-14
2 CRITICAL 2 HIGH 4 issues found • ~175 min saved vs. manual debugging

This report was generated automatically by analyzing pipeline run history, staging data, and production tables. We found 4 issues that need attention. Each issue below includes a plain-English explanation, the technical root cause, diagnostic evidence, and specific steps to fix it.

CRITICAL Issue #1: Column Mapping Broke
Pipeline: daily_orders_etl Stage: Data Transformation
Needs immediate attention — data is being lost or corrupted

What Happened

The system that sends us order data changed its format without warning. Our pipeline expected a column called "total" but it was renamed to something else. As a result, every order in this batch came in without a dollar amount.

Technical Root Cause

Source system schema changed (column renamed/added). The ETL column mapping broke, causing NULL values in the 'total' field for all ingested rows. The transform stage crashed on the missing expected column name.

What We Checked

5 automated diagnostic checks were run against the pipeline data. Click any check to see the raw evidence.

FAIL
Missing Values Check
Scanned incoming data for empty fields that should never be empty.
350 NULL values found across 2 columns. Worst: null_total (200 NULLs out of 925 rows).
Missing value breakdownnull_customer_id: 150, null_total: 200 Total incoming rows925
FAIL
Duplicate Records Check
Checked whether the same order appeared more than once in incoming data.
20 order IDs appear multiple times. Worst: 3 copies of a single order. 40 excess rows total.
Orders with duplicates20 Most copies of one order3 Extra rows40 Example orderORD-DUP-0100 Retry timestamps02:00:00, 02:00:30, 02:01:05 Caused by retryYes
FAIL
Data Format Check
Verified that dollar amounts are stored as numbers, not formatted text.
20 rows have non-numeric "total" values that would silently cast to $0.00.
Bad rows20 Example values$1,229.34 · $1,272.63 · $1,283.63 · $1,336.73 · $1,354.71 PatternCurrency symbols, commas, or text prefixes in numeric field
FAIL
Column Mapping Check
Checked whether all expected data columns are present and populated.
200 rows have NULL totals — consistent with a column mapping failure from schema drift.
Rows missing totals200 Pipeline errorKeyError: 'total' — column not found in source payload. Expected: [order_id, customer_id, order_date, status, payment_method, total]. Received: [order_id, customer_id, order_date, status, payment_method, order_total, currency]. Likely causeSource schema change — column renamed or added
FAIL
Production Impact Check
Checked if bad data made it into the live production database.
175 production orders show $0.00. Estimated revenue undercount: $25,158.
$0.00 orders175 Avg healthy order$143.76 Revenue impact$25,158.00

How to Fix This

  1. Add schema validation to the ingest stage — compare incoming columns against expected schema before processing.
  2. Implement a schema registry that tracks source system schemas and alerts on drift.
  3. Add a pre-flight check: compare source table columns to expected column list before ingestion.
  4. Set up source system change notifications with the POS team to catch upstream changes before they break the pipeline.
This automated diagnosis saves roughly 45 minutes compared to manual investigation.
HIGH Issue #2: Missing Customer Data
Pipeline: daily_orders_etl Stage: Database Loading
Should be fixed soon — the pipeline is blocked

What Happened

One of our data sources started sending orders without customer information attached. Our validation checks didn't catch it because they only look at dollar amounts, not customer IDs. The pipeline failed when it tried to save these incomplete orders to the production database.

Technical Root Cause

Upstream data source (wholesale channel) began sending orders without customer_id. Ingestion and validation stages did not catch the NULLs. The load stage failed on a foreign key constraint when attempting to insert into the production orders table.

How to Fix This

  1. Add NOT NULL validation for customer_id in the validate stage — currently only checks 'total'.
  2. Implement a data quality gate: reject batches where >5% of critical fields are NULL.
  3. Add source-specific validation rules — wholesale channel should map to a default wholesale customer if customer_id is missing.
  4. Set up alerting on NULL rate spikes per column per source system.
This automated diagnosis saves roughly 30 minutes compared to manual investigation.
HIGH Issue #3: Duplicate Orders
Pipeline: daily_orders_etl Stage: Data Validation
Should be fixed soon — the pipeline is blocked

What Happened

A network hiccup caused the same batch of orders to be sent multiple times. Our system should have caught the duplicates, but a bug in the deduplication logic let them slip through because each copy had a slightly different timestamp.

Technical Root Cause

Network timeout triggered retry mechanism, causing the ingestion batch to be sent 3 times. The dedup logic failed because it keys on (order_id, order_date) but retried rows have different ingested_at timestamps. 40 excess rows in staging.

How to Fix This

  1. Fix the dedup logic: key on order_id alone (not order_id + order_date), or use a hash of the full row.
  2. Add idempotency keys to the ingestion layer — reject rows where order_id already exists in staging for the same run_date.
  3. Implement a batch receipt log: if a batch ID was already processed, skip the retry.
  4. Add row count validation: if today's batch is >2x the trailing 7-day average, pause and alert.
This automated diagnosis saves roughly 40 minutes compared to manual investigation.
CRITICAL Issue #4: Data Format Corruption
Pipeline: daily_orders_etl Stage: After Loading (Silent Failure)
Needs immediate attention — data is being lost or corrupted

What Happened

A payment provider started sending dollar amounts as text (like "$142.50" or "USD 89.00") instead of plain numbers. The database silently converted these to $0.00 without raising any errors. The pipeline looked completely healthy, but revenue data was wrong. This is the most dangerous type of failure because nothing appears broken.

Technical Root Cause

New payments provider sends totals as formatted strings with currency symbols and commas ($142.50, USD 89.00, 1,234.56). SQLite's CAST(total AS REAL) silently converts these to 0.0, causing revenue data loss. Pipeline reports success — this is a silent data corruption issue.

How to Fix This

  1. Add type validation in the validate stage: attempt CAST and reject rows that produce 0.0 from non-zero input strings.
  2. Implement a cleaning function: strip currency symbols ($, USD, €), remove commas, then cast. Apply before the transform stage.
  3. Add a revenue sanity check post-load: compare today's loaded revenue to trailing average. Alert if deviation >30%.
  4. Coordinate with the payments provider to enforce numeric-only format in their API contract.
This automated diagnosis saves roughly 60 minutes compared to manual investigation.