Dataset Preparation
Automated dataset preparation including data collection, cleaning, labeling, augmentation, and splitting for ML model training.
Complexity: Moderate | Duration: 15-30m | Category: Devops
Tags: ml data-prep preprocessing etl
Workflow Diagram
Steps
Step 1: Data collection
Agent: data
scientist - Aggregate raw data from sources (API, database, files)
Step 2: Data cleaning
Agent: data
scientist - Handle missing values, remove duplicates, fix inconsistencies
Step 3: Data labeling
Agent: data
scientist - Apply labels (manual or semi-automated)
Step 4: Data augmentation
Agent: ml
engineer - Generate synthetic samples if needed (SMOTE, image transforms)
Step 5: Statistical analysis
Agent: data
scientist - Descriptive stats, distribution checks, outlier detection
Step 6: Dataset splitting
Agent: ml
engineer - Train/validation/test split with stratification
Step 7: Metadata documentation
Agent: data
scientist - Document schema, statistics, and versioning
Step 8: Validation
Agent: data
scientist - Verify split ratios, class balance, and data integrity
Usage
To execute this workflow:
/workflow devops/dataset-preparation.workflow
Related Workflows
See other workflows in this category for related automation patterns.