Foundations
Foundations¶
"Data problems aren't boring. They're just badly explained."
What is Data Engineering?¶
Data Engineering is the discipline of designing, building, and operating systems that transform raw data into reliable, accessible, and actionable information at scale. Unlike data science (which focuses on analysis and modeling) or software engineering (which focuses on application logic), data engineering sits at the intersection of infrastructure, reliability, and data product delivery.
"Data engineering isn't plumbing. It's product design with consequences."
Modern Definition¶
At its core, data engineering is about:
- Reliability: Ensuring data arrives on time, in the right format, with the right quality
- Scale: Handling terabytes to petabytes of data across thousands of pipelines
- Velocity: Supporting both batch and real-time use cases
- Governance: Maintaining lineage, quality, and compliance
- Cost Efficiency: Delivering value without breaking the bank
The Shift: From ETL to Platform¶
Traditional data engineering focused on ETL pipelines—point-to-point data movement with transformation logic embedded in the pipeline. Modern data engineering is about platforms—infrastructure that enables teams to:
Ingestion — Standardized patterns and contracts for pipeline creation with minimal friction
Transformation — Managed compute environments supporting multiple tools and reusable frameworks
Storage — Cost-appropriate tiering with automated lifecycle management and flexible formats
Serving — Multiple access patterns (SQL, APIs, feature stores) optimized for analytics, ML, and operational use cases
Core Principles¶
1. Data as a Product¶
Treat data assets as first-class products, not byproducts of applications.
Implications: - Clear ownership and accountability - Defined SLAs (freshness, availability, quality) - Versioned schemas and contracts - Documentation and discoverability - Lifecycle management
Anti-pattern: "Just dump the data somewhere and we'll figure it out later."
2. Separation of Concerns¶
Establish clear boundaries between:
- Ingestion: Getting data into the platform
- Transformation: Shaping data for consumption
- Storage: Persisting data in appropriate formats/tiers
- Serving: Delivering data to consumers
Why it matters: Each layer can evolve independently, scale independently, and fail independently.
┌─────────────┐
│ Ingestion │ ← Push/pull, CDC, streaming
└──────┬──────┘
│
┌──────▼──────┐
│ Storage │ ← Raw, curated, archive tiers
└──────┬──────┘
│
┌──────▼──────┐
│Transform │ ← ELT, streaming transforms
└──────┬──────┘
│
┌──────▼──────┐
│ Serving │ ← Analytics, ML, APIs
└─────────────┘
3. Platform Thinking¶
Build self-serve capabilities that enable teams, not bottlenecks.
Platform provides: - Standardized ingestion paths - Managed compute (Spark, Flink, etc.) - Storage abstractions (tables, partitions, lifecycle) - Metadata and discovery - Observability and alerting
Teams provide: - Business logic - Transformation code - Quality checks - Documentation
Anti-pattern: Central team manually creates every pipeline.
4. Cost Awareness¶
Every architectural decision has cost implications. Make them explicit.
Key cost drivers: - Compute: Query execution, transformation jobs - Storage: Hot, warm, cold tiers - Network: Cross-region transfers, egress - Operations: Pipeline maintenance, incident response
Principle: Start with the cheapest solution that meets requirements. Optimize when you have data.
5. Contract-First Design¶
Define data contracts before ingestion begins.
Contract includes: - Schema (with evolution rules) - Freshness SLA - Quality expectations - Ownership and contact - Cost attribution
Benefit: Prevents downstream breakage, enables automated validation.
Platform Maturity Model¶
Level 1: Ad-Hoc¶
- Manual pipeline creation
- No standard patterns
- Limited observability
- High operational burden
Level 2: Standardized¶
- Common ingestion patterns
- Standardized storage formats
- Basic monitoring
- Some self-serve capabilities
Level 3: Platform¶
- Self-serve ingestion
- Automated quality checks
- Rich metadata and discovery
- Cost attribution and optimization
Level 4: Product¶
- Data contracts enforced
- Predictive quality monitoring
- Automated optimization
- Multi-tenant isolation
Key Concepts¶
Data Freshness¶
Freshness = Time between when data is generated and when it's available for consumption.
Categories: - Real-time: < 1 minute (streaming) - Near real-time: 1-15 minutes (micro-batch) - Batch: 15 minutes - 24 hours - Historical: > 24 hours (backfills, archives)
Trade-off: Lower latency = higher cost.
Data Quality Dimensions¶
- Completeness: Are all expected records present?
- Accuracy: Does data reflect reality?
- Consistency: Is data consistent across sources?
- Timeliness: Is data fresh enough?
- Validity: Does data conform to schema?
- Uniqueness: Are there duplicates?
Schema Evolution¶
Schemas change. Design for it.
Strategies: - Backward compatible: New fields optional, old fields never removed - Versioning: Explicit schema versions with migration paths - Schema registry: Centralized schema management (e.g., Confluent Schema Registry)
Anti-pattern: Breaking changes without notice.
Next Steps¶
- End-to-End Lifecycle - Understand the complete data journey
- Platform & Operating Model - Design your platform architecture