Skip to main content

Dataset Preparation

Automated dataset preparation including data collection, cleaning, labeling, augmentation, and splitting for ML model training.

Complexity: Moderate | Duration: 15-30m | Category: Devops

Tags: ml data-prep preprocessing etl

Workflow Diagram

Steps

Step 1: Data collection

Agent: data

scientist - Aggregate raw data from sources (API, database, files)

Step 2: Data cleaning

Agent: data

scientist - Handle missing values, remove duplicates, fix inconsistencies

Step 3: Data labeling

Agent: data

scientist - Apply labels (manual or semi-automated)

Step 4: Data augmentation

Agent: ml

engineer - Generate synthetic samples if needed (SMOTE, image transforms)

Step 5: Statistical analysis

Agent: data

scientist - Descriptive stats, distribution checks, outlier detection

Step 6: Dataset splitting

Agent: ml

engineer - Train/validation/test split with stratification

Step 7: Metadata documentation

Agent: data

scientist - Document schema, statistics, and versioning

Step 8: Validation

Agent: data

scientist - Verify split ratios, class balance, and data integrity

Usage

To execute this workflow:

/workflow devops/dataset-preparation.workflow

See other workflows in this category for related automation patterns.