Ingestion Architecture
Ingestion Architecture¶
How ingestion fits into your overall data architecture.
Overview¶
Ingestion architecture defines how data flows from source systems into your platform. It's the foundation that everything else builds on.
Architecture Layers¶
graph TB
A[Source Systems] --> B[Ingestion Layer]
B1[Batch] --> B
B2[Streaming] --> B
B3[CDC] --> B
B --> C[Raw Storage<br/>Data Lake]
C --> D[Transformation Layer]
D --> E[Curated Storage<br/>Lakehouse/Warehouse]
E --> F[Serving Layer]
F1[Analytics] --> F
F2[ML Models] --> F
F3[APIs] --> F
style A fill:#e3f2fd
style B fill:#80deea
style C fill:#b2dfdb
style D fill:#80deea
style E fill:#b2dfdb
style F fill:#e3f2fd
Complete data flow from source systems to consumption.
Ingestion Patterns¶
Pattern 1: Batch Ingestion¶
Architecture:
Characteristics: - Scheduled execution (hourly, daily) - Full or incremental extracts - Higher latency (minutes to hours) - Lower cost per GB
Use when: - Historical loads - Large volumes - No real-time requirement
Pattern 2: Streaming Ingestion¶
Architecture:
Characteristics: - Continuous processing - Low latency (seconds to minutes) - Higher cost per GB (3-5x batch) - More complex failure handling
Use when: - Real-time requirements - Event-driven architecture - Low-latency use cases
Pattern 3: Change Data Capture (CDC)¶
Architecture:
Characteristics: - Captures inserts, updates, deletes - Maintains transaction consistency - Lower overhead than full extracts - Real-time or near real-time
Use when: - Database replication - Maintaining current state - Audit trails
Storage Architecture¶
Raw Layer Design¶
Purpose: Preserve source data exactly as received
Design principles: - Immutable - Never modify raw data (append-only) - Schema-on-read - Store in flexible formats - Partitioned - By ingestion time, source - Long retention - 7 years for compliance
Format: Parquet (analytics), Avro (streaming), JSON (flexible)
Example structure:
Curated Layer Design¶
Purpose: Cleaned, validated, enriched data
Design principles: - Schema-on-write - Enforced schemas - Partitioned - By business keys - Optimized - For query patterns - Versioned - Track changes over time
Format: Delta Lake, Iceberg, Parquet
Example structure:
Scalability Patterns¶
Horizontal Scaling¶
Multiple ingestion workers: - Partition sources across workers - Each worker handles subset of sources - Scale workers based on load
Vertical Scaling¶
Larger instances: - More CPU/memory per worker - Handle larger sources - Better for single large sources
Hybrid Approach¶
Combine both: - Horizontal for many small sources - Vertical for large sources
Error Handling¶
Retry Strategy¶
Exponential backoff:
max_retries = 5
base_delay = 1 # seconds
for attempt in range(max_retries):
try:
ingest(record)
break
except TransientError:
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
sleep(delay)
else:
send_to_dlq(record) # Dead letter queue
Dead Letter Queue (DLQ)¶
Purpose: Store records that failed after all retries
Implementation: - Separate storage (S3, BigQuery table) - Alert on DLQ size - Manual review and reprocessing
Monitoring¶
Key Metrics¶
Volume: - Records/second - GB/day - Partition count
Latency: - End-to-end latency - Processing time per record - Queue depth
Quality: - Schema validation failures - Duplicate rate - Missing data rate
Reliability: - Success rate - Error rate by type - DLQ size
Cost Optimization¶
Common Cost Traps¶
- Over-ingestion - Ingesting unused data
- Inefficient formats - JSON instead of Parquet
- Redundant ingestion - Multiple pipelines for same source
- Streaming when batch would suffice - 3-5x cost premium
Optimization Techniques¶
- Compression - Use Snappy or Zstd (2-5x reduction)
- Partitioning - Only process new partitions
- Incremental loads - Only fetch changed data
- Lifecycle policies - Move old data to cheaper storage
Agentic Controls & Data Zones in Ingestion Architecture¶
Self-Serve Contracts¶
Architecture pattern:
Ingestion platforms evolve to support contract-first, self-serve pipeline creation:
graph LR
A[Contract Definition] --> B[Pipeline Generation]
B --> C[Resource Provisioning]
C --> D[Monitoring Setup]
D --> E[Pipeline Active]
F[Schema Registry] -.Validates.-> A
G[Policy Engine] -.Enforces.-> B
style A fill:#80deea
style B fill:#80deea
style C fill:#b2dfdb
style D fill:#90caf9
style E fill:#c8e6c9
Implementation: - Contract stored in schema registry - Pipeline templates for common patterns - Automated resource provisioning - Standard monitoring and alerting
Impact: - Time to value: Hours instead of weeks - Consistency: Standard patterns enforced - Quality: Contracts prevent issues
Autonomous Error Detection¶
Architecture pattern:
Ingestion systems detect and respond to errors autonomously:
Error detection: - Real-time monitoring of pipeline health - Pattern recognition for common failures - Anomaly detection for unusual behavior
Autonomous response: - Automatic retry with backoff - Root cause analysis - Preventive actions - Escalation when needed
Example flow:
Error Detected
↓
Pattern Matching (network timeout)
↓
Automatic Retry (exponential backoff)
↓
Success → Continue
Failure → Escalate
Policy-Gated Control Planes¶
Architecture pattern:
Ingestion control planes enforce policies automatically:
Policy types: - Schema policies - Enforce contract compliance - Cost policies - Prevent cost overruns - Quality policies - Enforce quality standards - Security policies - Access control, encryption
Enforcement: - Policies defined as code - Automatic validation at ingestion boundary - Rejection of non-compliant data - Alerting on policy violations
Data Zones in Ingestion Flows¶
Zone-based ingestion architecture:
graph TB
A[Source Systems] --> B[Ingestion Layer]
B --> C[Raw Zone<br/>Immutable<br/>Long Retention]
C --> D[Curated Zone<br/>Validated<br/>Enriched]
D --> E[Processed Zone<br/>Aggregated]
D --> F[Feature/AI Zone<br/>ML-Ready]
G[Contracts] -.Govern.-> B
H[Policies] -.Enforce.-> C
I[Quality Checks] -.Validate.-> D
style C fill:#b2dfdb
style D fill:#80deea
style E fill:#90caf9
style F fill:#64b5f6
Zone characteristics:
Raw Zone ingestion: - Minimal transformation - Schema-on-read - Long retention - Immutable storage
Curated Zone ingestion: - Quality validation - Schema enforcement - Enrichment - Optimized formats
Impact on pipeline design: - Clear boundaries between zones - Zone-specific transformation logic - Zone-appropriate storage formats - Zone-specific lifecycle policies
Impact on Observability¶
Zone-aware observability:
- Raw Zone: Ingestion metrics, schema validation, volume
- Curated Zone: Quality scores, freshness, completeness
- Processed Zone: Query performance, usage patterns
- Feature Zone: Serving latency, feature freshness
Lineage tracking: - Zone-to-zone data flow - Transformation lineage - Ownership tracking - Impact analysis
Impact on Lineage¶
Zone-based lineage:
Benefits: - Clear data flow visualization - Zone-specific impact analysis - Ownership clarity - Compliance documentation
Related Topics¶
- Data Ingestion - Ingestion patterns
- Storage - Storage design
- Data Processing - Processing ingested data
- Platform Strategy - Strategic direction
Next: Data Orchestration →