Data preparation for machine learning: a practical guide
Data preparation routinely consumes 25-40% of ML project budget — and it's where projects most often overrun. The seven-stage process that ships, common pitfalls, and the tooling we deploy.

Data preparation is the engineering work that determines whether ML projects ship or stall. It routinely consumes 25-40% of project budget, but underestimating it remains the most common cause of cost overruns. Models trained on inadequately prepared data produce convincing predictions based on noise — worse than no model.
This article maps the seven-stage data preparation process that consistently produces production-ready ML datasets, common pitfalls, and the tooling we deploy. For broader ML cost framing, see calculating machine learning costs and how much does AI cost in 2026.
Why data prep is the highest-leverage ML work
Three reasons it dominates project economics:
- Garbage in, garbage out. ML models learn statistical patterns from data. Bad data produces bad patterns regardless of model sophistication.
- Iteration cost compounds. Issues caught during data prep cost hours; issues caught after model training cost weeks.
- Quality gates determine production readiness. Models can't deploy without data preparation that meets quality, fairness, and compliance standards.
The teams shipping production ML invest disproportionately in data prep. The teams that don't ship surprised reports about why models underperform.
Seven-stage data preparation process
Stage 1: Define data requirements
Before collecting anything, specify:
- Target variable — what is the model predicting?
- Features needed — what inputs are likely predictive?
- Volume requirements — how much data is needed for the model class?
- Coverage requirements — what time periods, geographies, customer segments?
- Quality requirements — accuracy, completeness, freshness SLAs
Without explicit requirements, data collection becomes scope creep that consumes budget without producing usable datasets.
Stage 2: Source data identification and access
Map where data lives:
- Internal systems (databases, warehouses, application logs)
- External sources (commercial datasets, public datasets, third-party APIs)
- Synthetic data generation (where real data is scarce or sensitive)
- Manually labeled data (where automation isn't viable)
Each source has different access patterns, quality characteristics, and compliance requirements. Inventory work in this stage prevents surprises later.
Stage 3: Collection and ingestion
Engineering work to actually get data into the prep pipeline:
- API integration with source systems
- Batch extraction from databases
- Streaming ingestion for real-time signals
- Scheduled refreshes for ongoing collection
- Error handling and retry logic
- Audit logging for compliance
Modern tooling (Fivetran, Airbyte, custom Python pipelines) handles much of this. The infrastructure choice determines ongoing operational cost.
Stage 4: Cleaning and quality validation
The work that consumes the largest share of data prep time:
- Handle missing values — imputation, flagging, or filtering
- Remove duplicates — exact duplicates, near-duplicates with similarity matching
- Fix inconsistencies — same fact represented differently across systems
- Filter outliers — bots, accidental records, sensor errors
- Validate types and ranges — numeric values within expected bounds, dates in valid ranges
- Detect and address bias — coverage gaps in protected classes
Automated profiling tools (Great Expectations, Soda, Pandas Profiling) accelerate quality validation substantially. Manual cleaning at scale doesn't work.
Stage 5: Feature engineering
Transforming raw data into model-ready features:
- Encoding categorical variables (one-hot, target encoding, embedding)
- Numerical scaling (standardization, normalization)
- Time-based features (day-of-week, hour-of-day, time-since-event)
- Aggregations (rolling averages, counts, recency)
- Domain-specific transformations (text vectorization, image augmentation)
- Feature selection (correlation analysis, importance ranking)
Feature engineering is where domain expertise meets engineering skill. Generic feature stores (Tecton, Feast, AWS SageMaker Feature Store) help with operationalization but don't replace domain-driven feature design.
Stage 6: Splitting and validation
Preparing data for training and evaluation:
- Train/validation/test splits — chronological for time-sensitive data, stratified for classification
- Cross-validation folds — for smaller datasets where holdout reduces signal
- Group-aware splitting — preventing leakage where multiple records relate to same entity
- Augmentation — synthetic data generation, label smoothing, adversarial examples
Splitting decisions made here affect every downstream metric. Wrong splits produce models that look great in training and fail in production.
Stage 7: Documentation and versioning
Production data prep requires reproducibility:
- Pipeline versioning — code-based pipelines under version control
- Dataset versioning — DVC, LakeFS, or commercial tools
- Documentation — what each transformation does, why, by whom, when
- Lineage — source-to-feature traceability for audit and debugging
- Compliance documentation — privacy, consent, retention details
Without this discipline, model debugging takes weeks and audits become impossible.
Common data preparation pitfalls
Underestimating cleaning effort. Project plans that allocate 1-2 weeks for data prep on enterprise datasets routinely overrun by 4-10x.
Manual processes that don't scale. Spreadsheet-based cleaning doesn't survive the first production refresh. Automate from day one.
Bias amplification. Cleaning that removes "outliers" can systematically under-represent minority classes. Validate that cleaning preserves fairness across protected attributes.
Train-test contamination. Information leakage between training and evaluation data produces models that look great offline and fail online. Group-aware and chronological splits prevent this.
Inadequate documentation. Teams that skip documentation find themselves unable to debug or reproduce results 6 months later.
Tooling we deploy
Profiling and quality: Pandas Profiling, Great Expectations, Soda, Apache Griffin, Deequ.
ETL/ELT pipelines: dbt, Apache Airflow, Dagster, Prefect.
Feature stores: Tecton, Feast, AWS SageMaker Feature Store, Vertex Feature Store.
Versioning: DVC, LakeFS, MLflow, Weights & Biases.
Synthetic data generation: SDV, Gretel, custom GAN-based approaches. See our synthetic data article for production patterns.
Labeling: Snorkel for programmatic labeling, Labelbox, Scale AI for managed labeling services.
For most enterprise ML projects, the toolchain combines: dbt for transformations, Great Expectations or Soda for quality, a feature store for operationalization, DVC for versioning. Specific tools added based on workload.
Three preparation scenarios
Small ML project (single dataset): Manual exploration with automated quality validation, basic feature engineering. $15K-$40K, 4-8 weeks.
Mid-size project (multiple sources): Modern data stack with feature store, automated pipelines, basic synthetic data. $60K-$180K, 12-20 weeks.
Enterprise ML platform: Comprehensive pipelines with quality observability, feature stores, versioning, synthetic data generation, governance integration. $300K-$800K+, 6-12 months.
Final framing
Data preparation is where ML projects succeed or fail. The teams shipping production ML invest disproportionately in this stage and reap compounding benefits across all downstream model work. The teams that minimize prep work to focus on "the interesting modeling" consistently underperform.
There's no shortcut. The discipline of clean, well-prepared, well-documented data preparation is the engineering foundation that makes ML investments capital-efficient.
Ready to scope an ML data preparation project? Run the Project Estimator for a deterministic ballpark, or book a 45-minute Discovery with our data engineering team — we'll review your data sources, ML use case, and quality requirements, and tell you honestly what scope of preparation your project actually needs.











