Data Engineering·September 28, 2024·5 min read

Data preparation for machine learning: a practical guide

Data preparation routinely consumes 25-40% of ML project budget — and it's where projects most often overrun. The seven-stage process that ships, common pitfalls, and the tooling we deploy.

By JustSoftLab Team

Data preparation for machine learning: a practical guide

Data preparation is the engineering work that determines whether ML projects ship or stall. It routinely consumes 25-40% of project budget, but underestimating it remains the most common cause of cost overruns. Models trained on inadequately prepared data produce convincing predictions based on noise — worse than no model.

This article maps the seven-stage data preparation process that consistently produces production-ready ML datasets, common pitfalls, and the tooling we deploy. For broader ML cost framing, see calculating machine learning costs and how much does AI cost in 2026.

Why data prep is the highest-leverage ML work

Three reasons it dominates project economics:

Garbage in, garbage out. ML models learn statistical patterns from data. Bad data produces bad patterns regardless of model sophistication.
Iteration cost compounds. Issues caught during data prep cost hours; issues caught after model training cost weeks.
Quality gates determine production readiness. Models can't deploy without data preparation that meets quality, fairness, and compliance standards.

The teams shipping production ML invest disproportionately in data prep. The teams that don't ship surprised reports about why models underperform.

Seven-stage data preparation process

Stage 1: Define data requirements

Before collecting anything, specify:

Target variable — what is the model predicting?
Features needed — what inputs are likely predictive?
Volume requirements — how much data is needed for the model class?
Coverage requirements — what time periods, geographies, customer segments?
Quality requirements — accuracy, completeness, freshness SLAs

Without explicit requirements, data collection becomes scope creep that consumes budget without producing usable datasets.

Stage 2: Source data identification and access

Map where data lives:

Internal systems (databases, warehouses, application logs)
External sources (commercial datasets, public datasets, third-party APIs)
Synthetic data generation (where real data is scarce or sensitive)
Manually labeled data (where automation isn't viable)

Each source has different access patterns, quality characteristics, and compliance requirements. Inventory work in this stage prevents surprises later.

Stage 3: Collection and ingestion

Engineering work to actually get data into the prep pipeline:

API integration with source systems
Batch extraction from databases
Streaming ingestion for real-time signals
Scheduled refreshes for ongoing collection
Error handling and retry logic
Audit logging for compliance

Modern tooling (Fivetran, Airbyte, custom Python pipelines) handles much of this. The infrastructure choice determines ongoing operational cost.

Stage 4: Cleaning and quality validation

The work that consumes the largest share of data prep time:

Handle missing values — imputation, flagging, or filtering
Remove duplicates — exact duplicates, near-duplicates with similarity matching
Fix inconsistencies — same fact represented differently across systems
Filter outliers — bots, accidental records, sensor errors
Validate types and ranges — numeric values within expected bounds, dates in valid ranges
Detect and address bias — coverage gaps in protected classes

Automated profiling tools (Great Expectations, Soda, Pandas Profiling) accelerate quality validation substantially. Manual cleaning at scale doesn't work.

Stage 5: Feature engineering

Transforming raw data into model-ready features:

Encoding categorical variables (one-hot, target encoding, embedding)
Numerical scaling (standardization, normalization)
Time-based features (day-of-week, hour-of-day, time-since-event)
Aggregations (rolling averages, counts, recency)
Domain-specific transformations (text vectorization, image augmentation)
Feature selection (correlation analysis, importance ranking)

Feature engineering is where domain expertise meets engineering skill. Generic feature stores (Tecton, Feast, AWS SageMaker Feature Store) help with operationalization but don't replace domain-driven feature design.

Stage 6: Splitting and validation

Preparing data for training and evaluation:

Train/validation/test splits — chronological for time-sensitive data, stratified for classification
Cross-validation folds — for smaller datasets where holdout reduces signal
Group-aware splitting — preventing leakage where multiple records relate to same entity
Augmentation — synthetic data generation, label smoothing, adversarial examples

Splitting decisions made here affect every downstream metric. Wrong splits produce models that look great in training and fail in production.

Stage 7: Documentation and versioning

Production data prep requires reproducibility:

Pipeline versioning — code-based pipelines under version control
Dataset versioning — DVC, LakeFS, or commercial tools
Documentation — what each transformation does, why, by whom, when
Lineage — source-to-feature traceability for audit and debugging
Compliance documentation — privacy, consent, retention details

Without this discipline, model debugging takes weeks and audits become impossible.

Common data preparation pitfalls

Underestimating cleaning effort. Project plans that allocate 1-2 weeks for data prep on enterprise datasets routinely overrun by 4-10x.

Manual processes that don't scale. Spreadsheet-based cleaning doesn't survive the first production refresh. Automate from day one.

Bias amplification. Cleaning that removes "outliers" can systematically under-represent minority classes. Validate that cleaning preserves fairness across protected attributes.

Train-test contamination. Information leakage between training and evaluation data produces models that look great offline and fail online. Group-aware and chronological splits prevent this.

Inadequate documentation. Teams that skip documentation find themselves unable to debug or reproduce results 6 months later.

Tooling we deploy

Profiling and quality: Pandas Profiling, Great Expectations, Soda, Apache Griffin, Deequ.

ETL/ELT pipelines: dbt, Apache Airflow, Dagster, Prefect.

Feature stores: Tecton, Feast, AWS SageMaker Feature Store, Vertex Feature Store.

Versioning: DVC, LakeFS, MLflow, Weights & Biases.

Synthetic data generation: SDV, Gretel, custom GAN-based approaches. See our synthetic data article for production patterns.

Labeling: Snorkel for programmatic labeling, Labelbox, Scale AI for managed labeling services.

For most enterprise ML projects, the toolchain combines: dbt for transformations, Great Expectations or Soda for quality, a feature store for operationalization, DVC for versioning. Specific tools added based on workload.

Three preparation scenarios

Small ML project (single dataset): Manual exploration with automated quality validation, basic feature engineering. $15K-$40K, 4-8 weeks.

Mid-size project (multiple sources): Modern data stack with feature store, automated pipelines, basic synthetic data. $60K-$180K, 12-20 weeks.

Enterprise ML platform: Comprehensive pipelines with quality observability, feature stores, versioning, synthetic data generation, governance integration. $300K-$800K+, 6-12 months.

Final framing

Data preparation is where ML projects succeed or fail. The teams shipping production ML invest disproportionately in this stage and reap compounding benefits across all downstream model work. The teams that minimize prep work to focus on "the interesting modeling" consistently underperform.

There's no shortcut. The discipline of clean, well-prepared, well-documented data preparation is the engineering foundation that makes ML investments capital-efficient.

Ready to scope an ML data preparation project? Run the Project Estimator for a deterministic ballpark, or book a 45-minute Discovery with our data engineering team — we'll review your data sources, ML use case, and quality requirements, and tell you honestly what scope of preparation your project actually needs.

Talk to the team behind this

Building something like this in production?

Our senior engineers ship this kind of work for real teams. 45-minute call, no pitch deck — just architecture, trade-offs, and whether we're the right fit for your problem.

Book a discovery call Estimate this in 60 sec

All insights