Data Engineering Workflows
Version: 1.0.0 Status: Production Last Updated: December 28, 2025 Category: Data Engineering
Workflow Overview
This document provides a comprehensive library of data engineering H.P.006-WORKFLOWS for the CODITECT platform. These H.P.006-WORKFLOWS cover ETL/ELT pipelines, data quality management, real-time streaming, data migration, and data warehouse automation. Each workflow includes detailed phase breakdowns, inputs/outputs, and success criteria to ensure reliable data operations.
Inputs
| Input | Type | Required | Description |
|---|---|---|---|
source_H.P.009-CONFIG | object | Yes | Data source connection H.P.009-CONFIGuration |
schema_definition | object | Yes | Schema for source and target data |
transformation_rules | array | Yes | Data transformation specifications |
quality_rules | array | No | Data quality validation rules |
schedule | string | No | Cron expression for scheduled runs |
destination_H.P.009-CONFIG | object | Yes | Target destination H.P.009-CONFIGuration |
Outputs
| Output | Type | Description |
|---|---|---|
job_id | string | Unique identifier for the pipeline run |
records_processed | integer | Number of records processed |
records_failed | integer | Number of records that failed validation |
quality_score | float | Overall data quality score (0-100) |
execution_metrics | object | Performance metrics (duration, throughput) |
lineage_info | object | Data lineage and transformation history |
Phase 1: Data Ingestion
Initial phase extracts data from source systems:
- Source Connection - Establish connection to data sources
- Schema Discovery - Detect and validate source schema
- Data Extraction - Extract data in batches or streams
- Initial Validation - Validate format and basic integrity
- Raw Storage - Store raw data in landing zone
Phase 2: Transformation & Quality
Core phase transforms data and validates quality:
- Data Cleaning - Handle nulls, duplicates, inconsistencies
- Transformation - Apply business logic transformations
- Enrichment - Join with reference data
- Quality Validation - Run quality checks
- Quarantine Handling - Route failed records
Phase 3: Loading & Cataloging
Final phase loads data to targets and updates metadata:
- Target Loading - Load transformed data to destination
- Index Updates - Update search and query indexes
- Catalog Update - Update data catalog with new datasets
- Lineage Recording - Record data lineage
- Notification - Alert stakeholders of completion
Data Engineering Workflow Library
1. etl-pipeline-workflow
- Description: Complete ETL pipeline from source extraction to target loading
- Trigger:
/run-etlor schedule - Complexity: complex
- Duration: 15-60m
- QA Integration: validation: required, review: required
- Dependencies:
- Agents: data-engineer, data-architect
- Commands: /extract, /transform, /load
- Steps:
- Source connection - data-engineer - Connect to source systems
- Data extraction - data-engineer - Extract data in batches
- Transformation - data-engineer - Apply transformation rules
- Quality check - data-architect - Validate data quality
- Loading - data-engineer - Load to target destination
- Tags: [etl, pipeline, batch, data-engineering]
2. data-quality-workflow
- Description: Comprehensive data quality validation and monitoring
- Trigger:
/data-quality-checkor pipeline completion - Complexity: moderate
- Duration: 5-30m
- QA Integration: validation: required, review: recommended
- Dependencies:
- Agents: data-engineer, data-analyst
- Commands: /validate-data, /quality-report
- Steps:
- Schema validation - data-engineer - Verify schema compliance
- Completeness check - data-engineer - Check for missing values
- Accuracy validation - data-analyst - Validate data accuracy
- Consistency check - data-engineer - Cross-reference validation
- Report generation - data-analyst - Generate quality report
- Tags: [data-quality, validation, monitoring]
3. streaming-pipeline-workflow
- Description: Real-time data streaming pipeline with Kafka/Kinesis
- Trigger: Continuous stream
- Complexity: complex
- Duration: Continuous
- QA Integration: validation: required, review: required
- Dependencies:
- Agents: data-engineer, streaming-specialist
- Commands: /stream-start, /stream-status
- Steps:
- Stream setup - streaming-specialist - Configure stream consumers
- Event processing - data-engineer - Process events in real-time
- Windowing - streaming-specialist - Apply time windows
- State management - data-engineer - Manage processing state
- Output routing - streaming-specialist - Route to sinks
- Tags: [streaming, kafka, real-time, event-driven]
4. data-migration-workflow
- Description: Large-scale data migration with validation and reconciliation
- Trigger:
/migrate-dataor manual - Complexity: complex
- Duration: 1h+
- QA Integration: validation: required, review: required
- Dependencies:
- Agents: data-engineer, data-architect, dba
- Commands: /migrate, /reconcile
- Steps:
- Source analysis - data-architect - Analyze source system
- Mapping definition - data-architect - Define source-target mapping
- Test migration - data-engineer - Run test migration
- Full migration - data-engineer - Execute full migration
- Reconciliation - dba - Verify data integrity
- Tags: [migration, data-movement, reconciliation]
5. data-catalog-workflow
- Description: Automated data cataloging and metadata management
- Trigger:
/catalog-updateor data ingestion - Complexity: moderate
- Duration: 5-15m
- QA Integration: validation: required, review: optional
- Dependencies:
- Agents: data-engineer, data-steward
- Commands: /catalog-scan, /tag-data
- Steps:
- Discovery - data-engineer - Scan for new datasets
- Profiling - data-engineer - Profile data characteristics
- Classification - data-steward - Classify sensitive data
- Tagging - data-steward - Apply metadata tags
- Publishing - data-engineer - Publish to catalog
- Tags: [catalog, metadata, governance]
Success Criteria
| Criterion | Target | Measurement |
|---|---|---|
| Pipeline Success Rate | >= 99.5% | Successful runs / Total runs |
| Data Quality Score | >= 95% | Average quality score |
| Processing Latency | < 15m batch, < 1s stream | P95 latency |
| Data Freshness | < 1h batch, < 5m stream | Time since last update |
| Record Error Rate | < 0.1% | Failed records / Total records |
| Schema Evolution Success | 100% | Compatible changes / Total changes |
Error Handling
| Error Type | Recovery Strategy | Escalation |
|---|---|---|
| Source connection failure | Retry with backoff | Alert after 3 failures |
| Schema mismatch | Quarantine and alert | Alert data steward |
| Transformation error | Log and skip record | Alert on high error rate |
| Quality validation failure | Route to quarantine | Alert data owner |
| Target write failure | Retry with idempotency | Alert on persistent failure |
Related Resources
- AI-ML-DEVELOPMENT-WORKFLOWS.md - ML H.P.006-WORKFLOWS
- ANALYTICS-BI-WORKFLOWS.md - Analytics H.P.006-WORKFLOWS
- WORKFLOW-LIBRARY-INDEX.md - Complete workflow catalog
Maintainer: CODITECT Core Team Standard: CODITECT-STANDARD-WORKFLOWS v1.0.0