WHAT THIS IS

A real data migration pipeline has four jobs.

Most data migrations are a contractor team manually keying records from one system into another over six months, an Excel macro that breaks every time the source format shifts, and a cutover weekend that goes 18 hours past schedule because the target system rejected 15% of records and nobody noticed until production traffic started. The job of a real migration pipeline is to industrialize the repetitive work (extraction, schema mapping, normalization, loading) while keeping human judgment where it actually adds value (ambiguous source data, conflicting records, business-rule edge cases).

Four jobs. One: extract structured data from raw sources — AI parses fields with confidence scores from PDFs, CSVs, scanned paper, legacy exports, third-party feeds. Multi-format normalization (dates as '5/12/24' or 'May 12 2024' or '20240512' all become canonical). Two: map to target schema with field-level transformations. Phone numbers normalize to E.164, addresses verify against postal databases, currency converts, enum values translate. Mapping table version-controlled. Three: route by confidence. Clean records (60-75% typical) flow direct to load. Enrich-eligible records get gap-filled from external sources with lineage tracking. Exception records route to reviewer UI with AI suggestion — reviewer validates instead of data-entering from scratch. Four: idempotent atomic load with read-back reconciliation. Re-runs don't duplicate; partial writes never persist; source-to-target counts and aggregates match within tolerance. Rejections feed pattern analysis for upstream fixes.

Done right, your migration completes 3-5x faster than manual data entry, your reviewer throughput climbs from 10-15 records/hour to 50-200, and your cutover weekend is a verification step rather than a panic. Done wrong, you ship aggressive auto-load that admits AI hallucinations to the target system silently, your reconciliation catches the gap two weeks later, and you're chasing data quality issues across hundreds of customer records.

BEFORE

Contractor team + spreadsheet macros

8-person contractor team migrating 240K customer records from legacy CRM to Salesforce over 6 months. Manual data entry from PDF exports + Excel uploads + paper scans. Custom macros for the 'easy' records; manual entry for the rest. Cost: $480K contractor fees. Schedule: 6 months → 9 months as scope expanded. Quality: 12% of records had errors discovered post-migration; 4% required manual rework. Cutover weekend: 22 hours of crisis mode when target system rejected 15K records nobody anticipated. Trust in migrated data was compromised for 18 months.

AFTER

AI extract + schema map + reviewer assist

Same 240K records. AI extracts with per-field confidence scoring; schema map normalizes; 70% of records flow clean to direct load (170K records, processed at 12K/day). 22% flow to enrichment with external sources filling missing fields (53K records). 8% flow to exception lane (19K records); 3-person reviewer team handles them at 80 records/hour with AI suggestion. Total time: 8 weeks. Cost: $84K all-in (build + tooling + reviewer time). Cutover weekend: 90 minutes of verification because reconciliation already showed 99.4% match. Production-quality data on day one.

FIT CHECK

Who this is for, who it isn't.

Migration automation pays back fastest for businesses with 50K+ records to migrate, mixed source formats (some clean, some messy), and a clear target system with documented schema. Below 10K records, manual is often faster than build-and-validate. Above 1M records, the math compounds dramatically.

HIGH LEVERAGE FOR

Build this if any of these are true.

You're migrating 50K+ records and your manual entry team would take 4+ months. That's the time being recovered.
Your source data spans multiple formats — PDFs, CSVs, paper scans, legacy exports. Multi-format extraction is exactly where AI excels.
You have a clear target schema and the target system supports batch APIs or bulk loads. Without target API support, the build complexity escalates.
You have data ops or analytics engineering capacity. The schema mapping design is real work; without it, you're building data quality issues into the target.
You have an analytics or operations leader who can lead post-migration data quality monitoring. Without ownership, post-migration drift compounds.

SKIP IF

Skip or wait if any of these are true.

You have under 10K records to migrate. Manual data entry by a small team is often faster than the build-and-validate cycle.
Your source and target both support clean export/import. Target system's native import covers most cases without custom build.
Your target system has strict business rules that change frequently. Migration automation locks in extraction logic; rapidly evolving rules invalidate it faster than the build pays back.
Your target schema is undocumented or actively evolving. Migration into a moving target produces records that fail business rules every week as rules change.
You're hoping AI handles ambiguous source data without human review. It can't reliably; ambiguous source records require human judgment. Plan for the exception lane explicitly.

Decision rule: If you have 50K+ records, mixed source formats, documented target schema, and data ops capacity, this is one of the highest-leverage Tier-2 operations automations. Skip if your scale is too low or your target system has clean native import.

THE HONEST MATH

What this saves, by the numbers.

The savings come from three sources, in order. Data ops time recovered (the largest line at scale — manual data entry is the most labor-intensive operation in any migration). Faster project completion compresses business risk and unlocks downstream value. Higher data quality reduces post-migration cleanup costs. Most teams see 1.5–2× the conservative numbers below by year two for ongoing data-entry replacement (vs one-time migration).

UNIVERSAL FORMULA

(Data ops hrs saved × loaded hourly cost) + (project compression × value unlocked × probability) + (data quality improvement × downstream cleanup avoided)

Data ops hours saved = roughly 70-90% of current manual entry + validation time. Project compression = months saved on migration timeline × monthly value of migration completion (CRM consolidation, ERP cutover, decommissioning legacy systems). Data quality improvement = avoided rework cost over 12-24 months post-migration.

SMALL OPERATOR

60K records · CRM migration · 1 data ops

$72K

per year saved

DATA OPS TIME: 600 hrs × $80 = $48K PROJECT COMPRESSION: 3 months × $40K = $120K (gross) DATA QUALITY: $40K MINUS BUILD + TOOLING: $36K NET YEAR 1 / PROJECT: ~$72K MATURE YEAR 2+: ~$140K (if ongoing)

MID-SIZE

500K records · ERP cutover · 4 data ops

$240K

per year saved

DATA OPS TIME: 3,200 hrs × $90 = $288K PROJECT COMPRESSION: 5 months × $80K = $400K (gross) DATA QUALITY: $120K MINUS TOOLING + OPS: $84K NET YEAR 2+: ~$240K conservative

LARGER SCALE

2M records · multi-system · 12 data ops

$540K

per year saved

DATA OPS TIME: 12,000 hrs × $110 = $1.32M PROJECT COMPRESSION: 8 months × $200K = $1.6M (gross) DATA QUALITY: $400K MINUS TOOLING + OPS: $240K NET YEAR 2+: ~$540K conservative

What's not in those numbers: Compound effects on downstream system reliability as cleaner data foundation reduces issue volume across customer support, faster M&A integration as migration playbook becomes reusable, and second-order benefits to analytics maturity since migrated-with-lineage data is queryable for audit. Most teams see 1.5–2× the conservative numbers above by year two when applied to ongoing data-entry replacement.

HOW IT WORKS

The architecture, end to end.

Migration architecture has a single trunk (source ingest, AI extraction, schema map) feeding 3 validation lanes. Clean handles high-confidence complete records (60-75% of typical volume) with daily 5% sample audit. Enrich fills gaps from external sources (Clearbit, USPS, D&B) with per-field lineage tracking and cost caps. Exception routes low-confidence and ambiguous records to reviewer UI with AI suggestion (50-200 records/hr vs 10-15 manual). All three lanes converge at idempotent atomic load. Loaded records go through read-back reconciliation; rejected records loop back through repair with upstream-fix pattern analysis. Click any node for the architectural detail; click a path label to highlight one route.

+ Click any node to expand. Click a path label below to highlight one route through the graph.

CLEAN ENRICH EXCEPTION LOADED REJECTED REPAIR

TRUNK · EXTRACT + MAP

▶

TRIGGER

Source data ingested

PDF, CSV, OCR, legacy export, third-party feed. Single trigger, multiple sources.

AI

AI / EXTRACT

Parse fields from raw source

Per-field confidence scoring. Low-confidence fields flag for review. Multi-format normalization.

03

MAP

Schema map + transform

Source-to-target mapping table. Phone → E.164. Address → verified. Currency → base.

PATH · CLEAN

✓

CLEAN

High-confidence + complete

60-75% of records when source quality decent. Skip enrichment + exception, direct to load.

✓↓

CLEAN

Sample audit + quality check

5% sample keeps lane honest. Threshold calibrated against actual accuracy data, not assumed.

PATH · ENRICH

+

ENRICH

Fill gaps from external sources

Clearbit, USPS, D&B, Apollo. Opt-in per field. Source documentation for audit.

+↓

ENRICH

Lineage tracking + cost cap

Per-record cost cap. $0.50 × 100K records = $50K. Skip enrichment when cost-benefit fails.

PATH · EXCEPTION

!

EXCEPTION

Low confidence + ambiguous fields

Date format ambiguity, conflicting values, duplicate detection. Specific flag per type.

!↓

EXCEPTION

Reviewer UI + AI suggestion

50-200 records/hr with AI assist vs 10-15/hr manual. Validation, not data-entry-from-scratch.

LOAD · IDEMPOTENT WRITE

⤧

LOAD

Idempotent write to target

Re-runs don't duplicate. Batch where supported. Atomic per record — no partial writes.

OUTCOME · LOADED

✓

LOADED

Validation + reconciliation

Read-back confirms write. Source-to-target reconciliation catches silent failures.

✓✓

SUCCESS

Feed downstream + finalize migration

Cutover plan: source historical, target source-of-truth, 30-day monitoring confirms no regressions.

OUTCOME · REJECTED

⤴

REJECTED

Target rejection + specific reason

Validation rule, uniqueness, FK constraint, business rule. Specific cause per rejection.

⤴↓

REJECTED

Repair + pattern → upstream fix

50+ records rejecting for same reason = upstream issue. Repair pattern, not just records.

TOOLS YOU'LL USE

Stack combinations that actually work.

Three stack combinations cover most builds. The decision usually comes down to source data formats and target system. AWS Textract + Lambda dominates document-heavy migrations; Fivetran + dbt covers SaaS-to-SaaS migrations; custom builds offer the most flexibility for complex multi-source projects.

COMBO 1

AWS Textract + Claude + Lambda + Salesforce

$840–$1,200/mo

AWS Textract / DocAI· OCR + extraction Claude + Lambda· AI orchestration Salesforce / target· target system

Tradeoff: The document-heavy stack. Textract or Google DocAI handle OCR + table extraction; Claude orchestrates field validation + schema mapping; Lambda for serverless execution; Salesforce or target system for load. About $1,000/mo all-in for a 500K-record migration. Best for paper-to-digital migrations and PDF-heavy sources.

COMBO 2

Fivetran + dbt + Snowflake + reverse ETL

$540–$840/mo

Fivetran· source ETL dbt + Snowflake· transform + warehouse Hightouch / Census· reverse ETL load

Tradeoff: The SaaS-to-SaaS stack. Fivetran for source extraction; dbt for transformation logic in Snowflake; Hightouch or Census for reverse ETL into the target. Best for clean SaaS migrations (HubSpot to Salesforce, etc) where source and target both have native APIs. Lower flexibility on document-heavy sources.

COMBO 3

Custom: Postgres + Python + Claude + n8n

$240–$540/mo

Postgres + Python· staging + transform Claude· AI extraction n8n + custom UI· orchestration + reviewer

Tradeoff: Most flexible. Postgres for staging; Python for extraction + transformation; Claude for AI parsing; n8n for orchestration; custom reviewer UI for exception handling. Best for technical teams with engineering capacity and unusual source formats. Highest build complexity. Worth it for complex multi-source migrations or ongoing data-entry replacement at scale.

MINIMUM VIABLE STACK

Claude API + Google Sheets + manual review

Cheapest viable. Claude API for extraction + Google Sheets for reviewer UI + manual load to target via target's native import. Skip the orchestration layer for v1. About $100/mo. Validates whether AI extraction works for your specific source formats before investing in production pipeline. Builds in 1-2 weeks for proof-of-concept.

PRODUCTION-GRADE STACK

Textract + Claude + Lambda + Postgres + reviewer UI + Salesforce

Production stack for 500K+ record migrations or ongoing data-entry replacement. AWS Textract ($300+/mo at scale), Claude Sonnet/Opus ($200-$500/mo), Lambda + Postgres ($100/mo), custom reviewer UI for exception handling, Salesforce or target system. About $1,200-$1,800/mo all-in. Adds the extraction reliability, exception throughput, reconciliation accuracy, and post-migration monitoring rhythm.

THE BUILD PATH

How to actually build this.

Six steps from zero to a production migration pipeline. The biggest mistake teams make is shipping extraction before the target schema is locked — without a frozen target, every schema change forces re-extraction at scale.

01

Lock target schema + mapping spec

Document the target schema explicitly: every field, type, validation rule, business rule, foreign-key relationship. Get sign-off from target system owner that the schema is frozen for the migration window. Document the source-to-target mapping: which source field maps to which target field, what transformation applies. Include explicit rules for ambiguous mappings ('source has no industry classification — leave target field NULL or enrich from D&B?'). Without this, every downstream issue traces back to schema instability.

What's at risk: Target schema changes mid-migration. Halfway through, target system owner adds a new required field. Records loaded earlier don't have it; records loaded after do. Inconsistent target. Get explicit migration-window freeze; if schema must change, pause migration and reassess.

ESTIMATE 5–8 days

02

Wire source ingestion + AI extraction

Ingestion pipeline handles every source format: CSV/Excel imports, PDF batches, scanned paper through OCR, legacy system exports, third-party feeds. AI extraction with per-field confidence scoring. Validate against 100-500 known-good records first; AI extraction must match expert annotation 92%+ before scaling. Source-format edge cases catalogued (some PDFs are scanned images, some are text-extractable; AI handles them differently).

What's at risk: OCR confidence ignored on poor-quality scans. Bad OCR feeds bad data into the rest of the pipeline. Hard threshold: characters below 0.85 confidence flag the field for review regardless of LLM confidence. Cumulative low-confidence at field level triggers exception lane.

ESTIMATE 7–10 days

03

Build schema map + transformations

Implement the source-to-target mapping with transformations: phone normalization, address verification, currency conversion, enum translation, foreign-key resolution. Validate against historical records with known correct target values; transformation accuracy must be 99%+ before scaling. Mapping table is version-controlled; changes follow PR review.

What's at risk: Transformation that drops information silently. Source has 'Customer Notes' field; target doesn't have a corresponding field. Transformation drops the data without flagging. Audit reveals the gap weeks later. Rule: every source field must be explicitly mapped (to a target field, to enrichment, or to documented-as-dropped). No silent drops.

ESTIMATE 5–8 days

04

Build the three validation lanes

Clean: confidence threshold + completeness check + type validation. Enrich: gap-fill from external sources with lineage tracking + cost caps. Exception: reviewer UI showing source alongside AI extraction with per-field confidence + AI suggestion. Build them in volume order — clean first (highest volume, simplest), enrich second, exception third with most care because reviewer UX directly determines throughput.

What's at risk: Reviewer UI is clunky. Reviewers throughput plummets to 20 records/hour because the UI requires too many clicks. Build the reviewer UI with throughput as primary metric: keystroke shortcuts, AI suggestion accept-with-one-key, batch operations on similar records. Test with actual reviewers in pilot before scaling.

ESTIMATE 8–12 days

05

Wire idempotent load + reconciliation

Target loading with idempotency keys (re-runs don't duplicate) and atomic per-record writes (partial writes never persist). Batch loading where target API supports it; single-record fallback otherwise. Read-back reconciliation: confirm write succeeded with right values. Source-to-target reconciliation: count-of-records and aggregate-totals match within tolerance. Drift surfaces specific records for investigation.

What's at risk: Reconciliation skipped to save time. Target writes return 200 OK but values didn't actually persist due to target-system maintenance window. Without read-back, the gap goes undetected for weeks. Build read-back even when slow; tolerance for missing this is exactly zero.

ESTIMATE 5–8 days

06

Add observability + cutover plan

Observability: extraction confidence distribution, lane-routing rates, reviewer throughput, rejection patterns by reason, reconciliation drift. Cutover plan: source system marked historical, target becomes source of truth, downstream consumers switch over, monitoring confirms no regressions for 30 days. Migration completion documentation: what migrated, what was excluded, known data quality limitations, full lineage.

What's at risk: Skipping the post-migration monitoring period. Cutover happens; nobody monitors for 30 days; data quality issues surface in customer support 6 weeks later when the connection between issue and migration is harder to make. 30-day monitoring period with named owner is non-negotiable.

ESTIMATE 4–6 days

TOTAL BUILD TIME 4–8 weeks · 1 builder + 1 data ops lead + 1 target system owner

COMMON ISSUES & FIXES

Where this fails in real deployments.

Five failure modes that wreck migration projects in production. Every team that's built this hits at least three of them.

01

AI hallucinates a customer phone number

Source PDF has illegible scanned phone number. AI's confidence is 0.78 — below threshold but not catastrophically so. AI extracts '555-123-4567' (a plausible number). Confidence flags the field for review but reviewer is overwhelmed and accepts AI's suggestion at speed. Number loads to target. Three weeks later, sales team calls the number — wrong person. Multiplied across 200 such records, the contact data is unreliable.

How to avoid: Hard floor on confidence: any field below 0.90 confidence on a phone or email goes to manual entry, not AI suggestion. AI suggestion only shown for 0.90+ confidence; below that, reviewer types from source document. Quarterly accuracy audit on AI-suggested-and-accepted values catches calibration drift; if accepted suggestions show 95%+ accuracy, threshold can stay; lower drops the threshold.

02

Schema map drops critical free-text notes

Source CRM has 'Customer Notes' field with rep-entered context: 'CFO is new; old CFO was decision-maker', 'Renewal due Q3', 'Prefers Friday meetings'. Target Salesforce doesn't have a direct corresponding field. Schema map silently drops it. Six weeks post-migration, sales team complaints surge — they lost institutional knowledge that was in those notes.

How to avoid: Every source field explicitly mapped: target field, enrichment, or documented-as-dropped with rationale. 'Customer Notes' might map to a target Description field, a custom field, or get split into structured tags by AI. Whatever the choice, it's explicit and signed off by source-system owner before migration. No silent drops.

03

Cutover weekend rejection cascade

Migration runs cleanly through pre-prod testing. Cutover weekend: production load starts. Target system rejects 8% of records due to a business rule that changed two weeks ago in target without notification (foreign-key constraint became stricter). Rejection cascade affects downstream relationships; team spends 22 hours of cutover weekend resolving. Migration goes 18 hours past schedule.

How to avoid: Schema freeze during migration window is non-negotiable. Target system owner explicitly commits to no schema changes during the migration window; any required change pauses migration. Pre-prod test runs against production-current schema (not stale schema from last month). 24-hour pre-cutover validation against production schema catches drift.

04

Enrichment cost runs over budget

Enrichment lane fills missing data via Clearbit at $0.50/record. 100K records flow through enrichment lane. $50K bill at end of month, vs $5K budgeted. Finance discovers the gap during monthly close; enrichment paused mid-migration; records that needed enrichment now flow to exception lane and overwhelm reviewers.

How to avoid: Per-record enrichment cost cap with hard daily/weekly limits. Cost dashboard alerts at 50% / 75% / 90% of budget. Enrichment ROI evaluated per field — some fields worth $0.50 enrichment (firmographic data for sales prioritization), others not (industry classification for non-targeting use). Skip enrichment when cost-benefit fails.

05

Exception lane backs up; reviewers fall behind

Migration starts smoothly. Week 2: source data quality is worse than expected; exception lane volume is 2x projected. 3-person reviewer team falls behind; backlog grows from 500 to 5,000 records. Project timeline slips. Migration kept partially blocked for weeks while reviewers catch up.

How to avoid: Pilot phase before full migration: process 5-10% of expected volume; measure actual confidence distribution and lane routing. Real exception rate drives reviewer staffing; project plan reflects measured rate, not estimated. AI suggestion-acceptance rate in reviewer UI directly drives throughput; tune UI for speed before scaling.

DIY VS HIRE

Build it yourself, or get help.

This is a Tier-2 build because the schema mapping design and reviewer UI throughput optimization are the hard work, not the AI. Done well, it pays back during the migration project itself and becomes infrastructure for future migrations. Done sloppily, it ships data quality issues into the target system that compound for years.

DO IT YOURSELF

Build it yourself

If you have data ops, target system ownership, and time to invest 4-8 weeks.

SKILL Data engineer + data ops lead + target system owner. Comfortable with prompt engineering, OCR APIs, schema mapping design, idempotent load patterns, reconciliation. Target system owner who can lock schema during migration window.

TIME 180–280 hours of build over 4–8 calendar weeks, plus 8–14 hours per week of pilot tuning, reviewer UI iteration, and reconciliation refinement for the first 90 days.

CASH COST $0 in services. Tooling adds $240–$1,200/mo depending on OCR + AI volume + target system. Total project cost includes Claude/Textract API consumption proportional to record count.

RISK Underestimating reviewer UI throughput optimization. The difference between a clunky reviewer UI (20 records/hour) and a tuned one (100+ records/hour) is the difference between an 8-week migration and a 4-month one. Budget significant UX iteration time.

HIRE A PARTNER

Hire a partner

If you're under cutover pressure or your team can't dedicate 8 weeks.

SCOPE Full design + build of the migration pipeline including target schema lock workshop, source ingestion + AI extraction with calibration, schema map + transformations, three validation lanes (clean/enrich/exception), idempotent load + reconciliation, cutover plan + 30-day post-migration monitoring, and a migration runbook for future projects.

TIMELINE 6–10 weeks from contract signed to fully shipped including cutover. 30-day stabilization where the partner monitors data quality and tunes thresholds.

CASH COST $32K–$120K project cost depending on record volume, source format complexity, and target system. Higher end for paper-heavy migrations with custom reviewer UI requirements.

PAYBACK Project-based payback (not annual) — typical migration partner saves 60-80% of the equivalent contractor cost while delivering 2-3 months faster than manual data entry. Faster if cutover is currently bottlenecking strategic initiative timeline.

BEFORE YOU REACH OUT

Want to get in touch with a partner to build this for you? Run the free audit first. It gives any partner the context they need on your business — your stack, your volume, your highest-leverage automation — so the first conversation is about scope, not discovery.

Run the free audit

Decision rule: If you have data ops capacity and the target system owner is a partner, build it yourself — the migration playbook becomes reusable infrastructure for future projects. If you're under cutover pressure or your team is stretched, hire a partner. The schema mapping design and reviewer UI throughput are what separate working migrations from data quality disasters.

RELATED AUTOMATIONS

Automations that pair with this one.

TOOL DECISIONS

Data entry + migration automation.

A real data migration pipeline has four jobs.

Contractor team + spreadsheet macros

AI extract + schema map + reviewer assist

Who this is for, who it isn't.

Build this if any of these are true.

Skip or wait if any of these are true.

What this saves, by the numbers.

The architecture, end to end.

Stack combinations that actually work.

How to actually build this.

Lock target schema + mapping spec

Wire source ingestion + AI extraction

Build schema map + transformations

Build the three validation lanes

Wire idempotent load + reconciliation

Add observability + cutover plan

Where this fails in real deployments.

AI hallucinates a customer phone number

Schema map drops critical free-text notes

Cutover weekend rejection cascade

Enrichment cost runs over budget

Exception lane backs up; reviewers fall behind

Build it yourself, or get help.

Build it yourself

Hire a partner

Automations that pair with this one.

The matchups that come up while building this.

Want to know if this is the highest-leverage automation for your business?

Data entry + migration automation.

A real data migration pipeline has four jobs.

Contractor team + spreadsheet macros

AI extract + schema map + reviewer assist

Who this is for, who it isn't.

Build this if any of these are true.

Skip or wait if any of these are true.

What this saves, by the numbers.

The architecture, end to end.

Stack combinations that actually work.

How to actually build this.

Lock target schema + mapping spec

Wire source ingestion + AI extraction

Build schema map + transformations

Build the three validation lanes

Wire idempotent load + reconciliation

Add observability + cutover plan

Where this fails in real deployments.

AI hallucinates a customer phone number

Schema map drops critical free-text notes

Cutover weekend rejection cascade

Enrichment cost runs over budget

Exception lane backs up; reviewers fall behind

Build it yourself, or get help.

Build it yourself

Hire a partner

Automations that pair with this one.

Contract intake + parsing

Internal knowledge base AI

Accounts payable automation

The matchups that come up while building this.

AWS Textract vs Google DocAI

Fivetran vs Stitch

Want to know if this is the highest-leverage automation for your business?