Lifecycle
End-to-End Lifecycle¶
"Data freshness is just trust, measured in minutes."
The data lifecycle encompasses the complete journey from source systems to consumption. Understanding this flow is critical for designing scalable, maintainable data platforms.
"Every broken pipeline started as 'we'll clean it later.'"
Overview¶
graph LR
A[Source Systems] --> B[Ingestion]
B --> C[Storage Raw]
C --> D[Transformation]
D --> E[Storage Curated]
E --> F[Serving]
B1[Batch] --> B
B2[Streaming] --> B
B3[CDC] --> B
C1[Data Lake] --> C
C2[Object Storage] --> C
D1[ELT] --> D
D2[Stream Processing] --> D
E1[Warehouse] --> E
E2[Lakehouse] --> E
F1[Analytics] --> F
F2[ML Models] --> F
F3[APIs] --> F
style A fill:#e3f2fd
style B fill:#80deea
style C fill:#b2dfdb
style D fill:#80deea
style E fill:#b2dfdb
style F fill:#e3f2fd
End-to-end lifecycle from ingestion to trusted consumption.
Each stage has distinct requirements, failure modes, and optimization opportunities.
Stage 1: Ingestion¶
Goal: Get data from source systems into the platform reliably and efficiently.
Ingestion Patterns¶
Batch Ingestion¶
- When: Historical loads, daily/hourly snapshots, large volumes
- Tools: Airflow, Spark, Dataflow (batch mode)
- Characteristics:
- Scheduled execution
- Full or incremental loads
- Higher latency (minutes to hours)
- Lower cost per GB
Streaming Ingestion¶
- When: Real-time analytics, event-driven systems, low-latency requirements
- Tools: Kafka, Pub/Sub, Kinesis, Flink, Dataflow (streaming)
- Characteristics:
- Continuous processing
- Low latency (seconds to minutes)
- Higher cost per GB
- More complex failure handling
Change Data Capture (CDC)¶
- When: Database replication, maintaining current state, audit trails
- Tools: Debezium, Datastream, DMS, Fivetran
- Characteristics:
- Captures inserts, updates, deletes
- Maintains transaction consistency
- Lower overhead than full extracts
- Requires source database support
Push vs Pull¶
Push (Source-initiated): - Source system sends data to platform - Pros: Real-time, source controls timing - Cons: Source must handle retries, platform must scale for bursts - Use when: Real-time requirements, source has reliable infrastructure
Pull (Platform-initiated): - Platform queries source system - Pros: Platform controls rate, easier backpressure - Cons: Polling overhead, may miss real-time events - Use when: Batch processing, source can't push, rate limiting needed
Ingestion Best Practices¶
- Idempotency: Same data ingested multiple times = same result
- Checkpointing: Track progress to enable resume on failure
- Backpressure: Handle source unavailability gracefully
- Schema validation: Validate at ingestion boundary
- Metadata capture: Record source, timestamp, version
Stage 2: Storage (Raw Layer)¶
Goal: Preserve source data exactly as received, with minimal transformation.
Characteristics¶
- Immutable: Never modify raw data (append-only)
- Schema-on-read: Store in flexible formats (JSON, Avro, Parquet)
- Partitioned: By ingestion time, source, or business key
- Retention: Long-term retention for audit and reprocessing
Storage Formats¶
| Format | Use Case | Pros | Cons |
|---|---|---|---|
| JSON | Flexible schemas, nested data | Human-readable, no schema needed | Large size, slow queries |
| Avro | Schema evolution, streaming | Compact, schema embedded | Requires schema registry |
| Parquet | Analytics, columnar queries | Highly compressed, fast scans | Write overhead, less flexible |
| CSV | Simple tabular data | Universal compatibility | No schema, poor compression |
Recommendation: Use Parquet for analytics workloads, Avro for streaming, JSON only when necessary.
Partitioning Strategy¶
Time-based partitioning (most common):
Benefits: - Query pruning (only scan relevant partitions) - Lifecycle management (delete old partitions) - Parallel processing
Key-based partitioning:
- Use when queries filter on specific keys
- Example: user_id hash partitioning for user-specific queries
Stage 3: Transformation¶
Goal: Transform raw data into curated, analysis-ready datasets.
ELT vs ETL¶
ETL (Extract-Transform-Load): - Transform before loading - Pros: Smaller storage footprint, pre-aggregated - Cons: Rigid, hard to reprocess, transformation logic in pipeline
ELT (Extract-Load-Transform): - Load raw data first, transform in place - Pros: Flexible, easy reprocessing, separation of concerns - Cons: Larger storage, compute cost for transformations
Modern approach: Prefer ELT. Storage is cheap, flexibility is valuable.
Transformation Types¶
1. Cleansing¶
- Remove duplicates
- Handle nulls
- Standardize formats
- Validate ranges
2. Enrichment¶
- Join with reference data
- Add computed fields
- Geocoding, lookups
3. Aggregation¶
- Rollups (hourly → daily)
- Summaries (counts, sums, averages)
- Window functions
4. Normalization¶
- Flatten nested structures
- Resolve entity relationships
- Create dimensional models
Transformation Patterns¶
Incremental Processing:
-- Only process new/updated records
SELECT * FROM raw.events
WHERE ingestion_timestamp > (
SELECT MAX(processed_timestamp) FROM curated.events
)
Upsert Pattern:
-- Merge new data with existing
MERGE INTO curated.users AS target
USING transformed.users AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
Slowly Changing Dimensions (SCD): - Type 1: Overwrite (no history) - Type 2: Add new row with validity period (full history) - Type 3: Add new column (limited history)
Streaming Transformations¶
For real-time pipelines:
- Windowing: Group events by time windows
- Stateful processing: Maintain aggregations across events
- Joins: Stream-to-stream or stream-to-table joins
- Complex event processing: Pattern matching, event sequences
Tools: Flink, Kafka Streams, Dataflow (streaming)
Stage 4: Storage (Curated Layer)¶
Goal: Store transformed data optimized for consumption.
Characteristics¶
- Schema-on-write: Enforced schemas (BigQuery, Snowflake, Delta Lake)
- Partitioned and indexed: Optimized for query patterns
- Versioned: Track changes over time
- Documented: Clear ownership, purpose, usage
Storage Tiers¶
| Tier | Use Case | Access Pattern | Cost |
|---|---|---|---|
| Hot | Active queries, dashboards | Frequent, low latency | High |
| Warm | Ad-hoc analysis, reporting | Occasional, moderate latency | Medium |
| Cold | Compliance, historical | Rare, high latency acceptable | Low |
Lifecycle policies: Automatically move data between tiers based on age/access patterns.
Data Models¶
Star Schema (Kimball): - Fact tables (transactions, events) - Dimension tables (users, products, time) - Optimized for analytics queries
Data Vault: - Hubs (business keys) - Links (relationships) - Satellites (attributes) - Optimized for auditability and change tracking
One Big Table (OBT): - Denormalized, wide tables - Simple queries, no joins - Higher storage cost, simpler queries
Choose based on: Query patterns, update frequency, storage budget.
Stage 5: Serving¶
Goal: Deliver data to consumers in the right format, at the right time.
Serving Patterns¶
Analytics Serving¶
- Tools: BigQuery, Snowflake, Redshift, Databricks
- Format: SQL queries, dashboards
- Characteristics: Ad-hoc queries, aggregations, joins
ML Serving¶
- Tools: Feature stores (Feast, Tecton), model serving (Seldon, BentoML)
- Format: Feature vectors, embeddings
- Characteristics: Low-latency lookups, point-in-time correctness
Operational Serving¶
- Tools: APIs, key-value stores (Redis, DynamoDB)
- Format: REST/GraphQL APIs, lookups
- Characteristics: Sub-second latency, high availability
Data Sharing¶
- Tools: Delta Sharing, BigQuery authorized views, S3 access
- Format: Direct table access, exports
- Characteristics: Cross-organization, governed access
Serving Best Practices¶
- Materialized views: Pre-compute common aggregations
- Caching: Cache frequent queries (Redis, application cache)
- Query optimization: Partition pruning, column selection, predicate pushdown
- Access control: Row-level security, column masking
- Rate limiting: Prevent resource exhaustion
Lifecycle Management¶
Data Retention¶
Define retention policies for each layer:
- Raw: Long retention (1-7 years) for audit and reprocessing
- Curated: Medium retention (90 days - 2 years) based on business needs
- Archive: Compressed, cold storage for compliance
Data Deletion¶
- Soft delete: Mark as deleted, retain for recovery
- Hard delete: Permanently remove (GDPR, retention expiry)
- Cascade: Delete dependent datasets when source is deleted
Versioning¶
Track changes to: - Schemas (evolution history) - Transformations (code versions) - Data (snapshots, change logs)
Monitoring the Lifecycle¶
Key metrics per stage:
- Ingestion: Volume, latency, error rate, schema drift
- Transformation: Processing time, success rate, data quality
- Storage: Size, growth rate, partition count, tier distribution
- Serving: Query latency, error rate, cache hit rate, cost per query
Alerting: Set thresholds for SLAs, error rates, cost anomalies.
Next Steps¶
- Ingestion Architecture - Deep dive into ingestion patterns
- Storage & Data Architecture - Storage design patterns