Batch vs Streaming
Ingestion Architecture¶
"If a data problem can't be explained in one screen, the system is already broken."
Ingestion is the foundation of your data platform. Get it wrong, and everything downstream suffers. This chapter provides deep, opinionated guidance on building reliable, cost-effective ingestion systems.
"Data freshness is just trust, measured in minutes."
Decision Framework¶
Before choosing an ingestion pattern, answer these questions:
- Freshness requirement: Real-time (< 1 min), near real-time (1-15 min), or batch (15+ min)?
- Volume: How many records/second? How many GB/day?
- Source type: Database, API, files, event stream?
- Change detection: Do you need to capture updates/deletes, or just new records?
- Cost sensitivity: What's your budget per GB ingested?
Batch vs Streaming vs CDC¶
Batch Ingestion¶
When to use: - Historical loads, backfills - Large volumes (> 100 GB per run) - No real-time requirement - Source systems that don't support streaming
Characteristics: - Scheduled execution (hourly, daily) - Full or incremental extracts - Higher latency (minutes to hours) - Lower cost per GB - Easier to debug and reprocess
Common patterns:
# Full extract
SELECT * FROM source_table
WHERE ingestion_date = CURRENT_DATE
# Incremental extract (timestamp-based)
SELECT * FROM source_table
WHERE updated_at > :last_ingestion_time
# Incremental extract (change log)
SELECT * FROM source_table
WHERE id IN (
SELECT id FROM change_log
WHERE processed = FALSE
)
Tools: Airflow + Spark, Dataflow (batch), dbt, Fivetran
Streaming Ingestion¶
When to use: - Real-time analytics requirements - Event-driven architectures - Low-latency use cases (fraud detection, recommendations) - High-volume, continuous data
Characteristics: - Continuous processing - Low latency (seconds to minutes) - Higher cost per GB - More complex failure handling - Requires message queue/bus
Architecture:
Tools: Kafka, Pub/Sub, Kinesis, Flink, Dataflow (streaming), Kafka Connect
Cost consideration: Streaming is 3-5x more expensive than batch for the same volume. Only use when latency justifies cost.
Change Data Capture (CDC)¶
When to use: - Database replication - Maintaining current state tables - Audit trails - Real-time synchronization
Characteristics: - Captures inserts, updates, deletes - Maintains transaction consistency - Lower overhead than full extracts - Requires source database support (WAL, binlog)
CDC Patterns:
- Log-based CDC: Read database transaction logs
- Tools: Debezium, Datastream, DMS
- Pros: Low overhead, captures all changes
-
Cons: Requires database configuration
-
Trigger-based CDC: Database triggers write to change table
- Pros: Works with any database
-
Cons: Higher overhead, may miss some changes
-
Query-based CDC: Poll for changes using timestamps/version columns
- Pros: Simple, no database changes
- Cons: May miss deletes, higher overhead
Recommendation: Use log-based CDC when available. It's the most reliable and efficient.
Push vs Pull¶
Push (Source-Initiated)¶
Architecture:
Pros: - Real-time delivery - Source controls timing - No polling overhead
Cons: - Source must handle retries - Platform must scale for bursts - Requires source system changes
When to use: - Real-time requirements - Source has reliable infrastructure - You control the source system
Implementation considerations: - Idempotency keys (deduplicate retries) - Rate limiting (prevent abuse) - Authentication (secure endpoints) - Backpressure (reject when overloaded)
Pull (Platform-Initiated)¶
Architecture:
Pros: - Platform controls rate - Easier backpressure - No source system changes
Cons: - Polling overhead - May miss real-time events - Higher latency
When to use: - Batch processing - Source can't push - Rate limiting needed - Legacy systems
Optimization: - Incremental queries (only fetch new data) - Parallel pulls (multiple workers) - Adaptive polling (increase frequency when data available)
Tool Selection Guide¶
Ingestion Engines¶
| Tool | Type | Best For | Cost Model |
|---|---|---|---|
| Airflow + Spark | Batch | Large volumes, complex transforms | Compute + storage |
| Dataflow | Batch/Streaming | GCP-native, auto-scaling | Per vCPU-hour |
| Fivetran | SaaS | Database replication, zero maintenance | Per connector, per row |
| Stitch | SaaS | Simple extracts, cost-effective | Per row |
| Debezium | CDC | Kafka-based CDC, open source | Infrastructure only |
| Datastream | CDC | GCP-native CDC, managed | Per GB processed |
| Kafka Connect | Streaming | Kafka ecosystem, extensible | Infrastructure only |
Decision Matrix¶
High volume (> 1 TB/day), batch: → Airflow + Spark or Dataflow (batch)
Real-time, event streams: → Kafka + Flink or Dataflow (streaming)
Database replication, CDC: → Debezium, Datastream, or Fivetran
Multiple sources, zero maintenance: → Fivetran or Stitch (SaaS)
Cost-sensitive, simple extracts: → Airflow + custom scripts
Cost vs Freshness Trade-offs¶
Rule of thumb: Every 10x reduction in latency costs 3-5x more.
| Latency | Pattern | Cost per GB | Use Case |
|---|---|---|---|
| < 1 min | Streaming | $0.10-0.50 | Real-time dashboards, fraud |
| 1-15 min | Micro-batch | $0.05-0.15 | Near real-time analytics |
| 15 min - 1 hr | Batch (frequent) | $0.02-0.05 | Hourly reports |
| 1-24 hrs | Batch (daily) | $0.01-0.02 | Daily ETL, data warehouse |
| > 24 hrs | Batch (weekly) | $0.005-0.01 | Historical analysis |
Optimization strategy: 1. Start with the slowest acceptable latency 2. Measure actual requirements (not perceived) 3. Optimize only when latency becomes a bottleneck 4. Use tiered approach: streaming for critical, batch for rest
Common Patterns¶
Pattern 1: Idempotent Ingestion¶
Problem: Source may send duplicate data (retries, failures).
Solution: Use idempotency keys.
def ingest_record(record):
idempotency_key = f"{source}_{record.id}_{record.timestamp}"
if exists_in_dedupe_table(idempotency_key):
return # Already processed
process_record(record)
insert_dedupe_table(idempotency_key)
Storage: Use idempotency table with TTL (e.g., 7 days).
Pattern 2: Checkpointing¶
Problem: Long-running jobs fail partway through.
Solution: Track progress, enable resume.
checkpoint = get_last_checkpoint(job_id)
records = source.fetch_since(checkpoint.last_processed_id)
for record in records:
process(record)
checkpoint.update(record.id, record.timestamp)
checkpoint.save() # Frequent saves
Storage: Checkpoint table or file (S3, GCS).
Pattern 3: Backpressure Handling¶
Problem: Source sends data faster than platform can process.
Solution: Implement backpressure.
# Option 1: Rate limiting
rate_limiter = RateLimiter(max_per_second=1000)
for record in stream:
rate_limiter.wait()
process(record)
# Option 2: Queue with max size
queue = Queue(maxsize=10000)
if queue.full():
reject_request() # Return 503
else:
queue.put(record)
Pattern 4: Schema Evolution¶
Problem: Source schema changes break ingestion.
Solution: Schema registry + validation.
schema_registry = SchemaRegistry()
def ingest(record):
# Validate against latest schema
schema = schema_registry.get_latest('source_name')
validated = schema.validate(record)
# Handle backward compatibility
if not validated:
handle_schema_mismatch(record, schema)
store(validated)
Tools: Confluent Schema Registry, AWS Glue Schema Registry
Error Handling¶
Retry Strategy¶
Exponential backoff:
max_retries = 5
base_delay = 1 # seconds
for attempt in range(max_retries):
try:
ingest(record)
break
except TransientError:
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
sleep(delay)
else:
send_to_dlq(record) # Dead letter queue
Retryable errors: Network timeouts, rate limits, temporary unavailability
Non-retryable errors: Authentication failures, invalid schema, business logic errors
Dead Letter Queue (DLQ)¶
Purpose: Store records that failed after all retries.
Implementation: - Separate storage (S3, BigQuery table) - Alert on DLQ size - Manual review and reprocessing
Monitoring: Track DLQ size, error types, reprocessing success rate.
Monitoring & Observability¶
Key Metrics¶
Volume metrics: - Records/second - GB/day - Partition count
Latency metrics: - End-to-end latency (source → storage) - Processing time per record - Queue depth
Quality metrics: - Schema validation failures - Duplicate rate - Missing data rate
Reliability metrics: - Success rate - Error rate by type - DLQ size - Retry count
Alerting¶
Critical alerts: - Ingestion stopped (zero records in last N minutes) - Error rate > threshold (e.g., 5%) - Latency > SLA (e.g., 15 minutes for batch) - DLQ size > threshold
Warning alerts: - Volume drop > 20% - Schema drift detected - Cost spike > 20%
Cost Optimization¶
Common Cost Traps¶
- Over-ingestion: Ingesting data that's never used
-
Solution: Track usage, archive unused sources
-
Inefficient formats: Using JSON instead of Parquet
-
Solution: Convert to columnar formats post-ingestion
-
Redundant ingestion: Multiple pipelines for same source
-
Solution: Centralize, reuse outputs
-
Streaming when batch would suffice: 3-5x cost premium
- Solution: Measure actual latency requirements
Optimization Techniques¶
- Compression: Use Snappy or Zstd (2-5x reduction)
- Partitioning: Only process new partitions
- Incremental loads: Only fetch changed data
- Batching: Group small records into batches
- Lifecycle policies: Move old data to cheaper storage
Expected savings: 20-40% with basic optimizations.
Next Steps¶
- Storage & Data Architecture - How to store ingested data
- Cost Efficiency & Scale - Advanced cost optimization