Data Ingestion
Data Ingestion¶
"Data freshness is just trust, measured in minutes."
Getting data from source systems into your platform reliably and efficiently.
Overview¶
Ingestion is the foundation of your data platform. Get it wrong, and everything downstream suffers. This section provides deep, opinionated guidance on building reliable, cost-effective ingestion systems.
Decision Framework¶
Before choosing an ingestion pattern, answer these questions:
- Freshness requirement: Real-time (< 1 min), near real-time (1-15 min), or batch (15+ min)?
- Volume: How many records/second? How many GB/day?
- Source type: Database, API, files, event stream?
- Change detection: Do you need to capture updates/deletes, or just new records?
- Cost sensitivity: What's your budget per GB ingested?
Key Topics¶
Batch vs Streaming¶
When to use batch, streaming, or CDC patterns.
Learn about: - Batch ingestion patterns - Streaming ingestion architecture - Change Data Capture (CDC) - Cost vs freshness trade-offs - Tool selection guide
Change Data Capture (CDC)¶
Capturing database changes in real-time.
Learn about: - Log-based CDC - Trigger-based CDC - Query-based CDC - CDC tools (Debezium, Datastream) - Current state patterns
Push vs Pull¶
Source-initiated vs platform-initiated ingestion.
Learn about: - Push architecture (webhooks, APIs) - Pull architecture (scheduled queries) - When to use each - Implementation patterns - Error handling
Strategic Guidelines & Future Thinking¶
Strategic approaches to building ingestion systems that scale and evolve.
Learn about: - Contracts before pipelines - Paved paths over pipeline sprawl - Freshness as first-class SLO - Cost-aware ingestion design - Lineage and observability - Legacy migration strategies - Domain autonomy patterns - Future-proofing for AI-assisted ingestion
Ingestion Patterns¶
graph LR
A[Source Systems] --> B[Batch<br/>Scheduled<br/>Hours/Days]
A --> C[Streaming<br/>Continuous<br/>Seconds]
A --> D[CDC<br/>Real-time<br/>Transaction Log]
B --> E[Storage]
C --> E
D --> E
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style B fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
style C fill:#fff3e0,stroke:#f57c00,stroke-width:3px
style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px
style E fill:#b2dfdb,stroke:#00796b,stroke-width:3px
Three core ingestion patterns: Batch (scheduled), Streaming (continuous), and CDC (real-time).
Batch Ingestion¶
When to use: - Historical loads, backfills - Large volumes (> 100 GB per run) - No real-time requirement - Source systems that don't support streaming
Characteristics: - Scheduled execution (hourly, daily) - Full or incremental extracts - Higher latency (minutes to hours) - Lower cost per GB - Easier to debug and reprocess
Streaming Ingestion¶
When to use: - Real-time analytics requirements - Event-driven architectures - Low-latency use cases (fraud detection, recommendations) - High-volume, continuous data
Characteristics: - Continuous processing - Low latency (seconds to minutes) - Higher cost per GB (3-5x batch) - More complex failure handling - Requires message queue/bus
Change Data Capture (CDC)¶
When to use: - Database replication - Maintaining current state tables - Audit trails - Real-time synchronization
Characteristics: - Captures inserts, updates, deletes - Maintains transaction consistency - Lower overhead than full extracts - Requires source database support (WAL, binlog)
Cost vs Freshness Trade-offs¶
Cost Consideration
Every 10x reduction in latency costs 3-5x more.
| Latency | Pattern | Cost per GB | Use Case |
|---|---|---|---|
| < 1 min | Streaming | $0.10-0.50 | Real-time dashboards, fraud |
| 1-15 min | Micro-batch | $0.05-0.15 | Near real-time analytics |
| 15 min - 1 hr | Batch (frequent) | $0.02-0.05 | Hourly reports |
| 1-24 hrs | Batch (daily) | $0.01-0.02 | Daily ETL, data warehouse |
| > 24 hrs | Batch (weekly) | $0.005-0.01 | Historical analysis |
Optimization strategy: 1. Start with the slowest acceptable latency 2. Measure actual requirements (not perceived) 3. Optimize only when latency becomes a bottleneck 4. Use tiered approach: streaming for critical, batch for rest
Best Practices¶
Idempotency¶
Same data ingested multiple times = same result.
Checkpointing¶
Track progress to enable resume on failure.
Backpressure¶
Handle source unavailability gracefully.
Schema Validation¶
Validate at ingestion boundary.
Metadata Capture¶
Record source, timestamp, version.
Related Topics¶
- Data Architecture - How to store ingested data
- Data Quality - Ensuring data reliability
- Data Engineering - Platform fundamentals
Next: Batch vs Streaming →