Cost Efficiency
Cost Efficiency & Scale¶
"Cost explosions happen when no one feels ownership."
Cost optimization isn't about cutting corners—it's about spending wisely. At scale, small inefficiencies compound into massive waste. This chapter provides practical, actionable guidance on reducing costs by 20-40% without sacrificing quality or performance.
"Every broken pipeline started as 'we'll clean it later.'"
Cost Drivers¶
Understanding Your Costs¶
pie title "Cost Distribution"
"Compute" : 50
"Storage" : 25
"Network" : 10
"Operations" : 10
"Other" : 5
Typical cost breakdown:
| Category | % of Total | Description |
|---|---|---|
| Compute | 40-60% | Query execution, transformations |
| Storage | 20-30% | Data storage across tiers |
| Network | 5-15% | Data transfer, egress |
| Operations | 5-10% | Pipeline orchestration, monitoring |
Compute costs: - Query execution (BigQuery, Snowflake slots) - Transformation jobs (Spark, Dataflow) - Streaming processing (Flink, Kafka)
Storage costs: - Hot storage (frequently accessed) - Warm storage (occasional access) - Cold storage (rarely accessed)
Network costs: - Cross-region transfers - Egress to internet - Inter-service communication
Common Cost Traps¶
Trap 1: Over-Ingestion¶
Problem: Ingesting data that's never used.
Symptoms: - Tables with zero queries in 90 days - High storage cost, low usage - "Just in case" ingestion
Solution:
-- Identify unused tables
SELECT
table_name,
storage_bytes,
last_query_time,
days_since_last_query
FROM table_usage_stats
WHERE days_since_last_query > 90
ORDER BY storage_bytes DESC
Action: Archive or delete unused data. Expected savings: 10-20%.
Trap 2: Inefficient File Formats¶
Problem: Using JSON or CSV instead of Parquet.
Cost impact: - JSON: 5-10x larger than Parquet - CSV: 3-5x larger than Parquet - Higher storage + compute costs
Solution: Convert to Parquet post-ingestion.
# Convert JSON to Parquet
df = spark.read.json("s3://raw/events/*.json")
df.write.parquet("s3://raw/events_parquet/", compression="zstd")
Expected savings: 50-70% on storage, 30-50% on compute.
Trap 3: Streaming When Batch Would Suffice¶
Problem: Using streaming for use cases that don't need real-time.
Cost impact: Streaming is 3-5x more expensive than batch.
Decision framework: - Real-time requirement (< 1 min)? → Streaming - Near real-time (1-15 min)? → Micro-batch - Batch acceptable (15+ min)? → Batch
Example:
Use case: Daily reporting dashboard
Current: Streaming (cost: $1000/month)
Better: Daily batch (cost: $200/month)
Savings: $800/month (80%)
Trap 4: No Lifecycle Policies¶
Problem: All data in expensive hot storage.
Solution: Automatically move to cheaper tiers.
# Example: S3 lifecycle policy
lifecycle:
- days: 30
move_to: STANDARD_IA # 50% cheaper
- days: 90
move_to: GLACIER # 80% cheaper
Expected savings: 50-70% on old data.
Trap 5: Small Files Problem¶
Problem: Many small files (e.g., 10,000 files of 1MB each).
Impact: - Slow queries (many file opens) - Higher compute cost (overhead) - Inefficient storage
Solution: Compact small files.
# Compact small files
df = spark.read.parquet("s3://data/partition=2024-01-15/")
df.coalesce(10).write.parquet("s3://data/partition=2024-01-15/")
Target: 100-500MB per file (for Parquet).
Trap 6: Redundant Processing¶
Problem: Multiple pipelines processing same data.
Symptoms: - Same source ingested multiple times - Same transformation computed multiple times - Duplicate storage
Solution: Centralize, reuse outputs.
-- Instead of:
SELECT * FROM raw.events WHERE date = '2024-01-15' -- Pipeline A
SELECT * FROM raw.events WHERE date = '2024-01-15' -- Pipeline B
-- Do:
CREATE TABLE shared.events_2024_01_15 AS
SELECT * FROM raw.events WHERE date = '2024-01-15'
-- Both pipelines use shared table
Optimization Strategies¶
1. Right-Size Compute¶
Problem: Over-provisioned compute (paying for unused capacity).
Solution: Match compute to workload.
Batch jobs:
# Start small, scale up if needed
spark.conf.set("spark.executor.instances", "10")
spark.conf.set("spark.executor.cores", "4")
# Monitor and adjust
# If CPU < 50%: Reduce instances
# If CPU > 80%: Increase instances
Streaming:
# Use auto-scaling
flink_config = {
"parallelism": "auto",
"min_parallelism": 2,
"max_parallelism": 20
}
Expected savings: 20-30%.
2. Query Optimization¶
Partition pruning:
-- BAD: Scans all partitions
SELECT * FROM events
WHERE user_id = '123'
-- GOOD: Only scans relevant partition
SELECT * FROM events
WHERE date = '2024-01-15' AND user_id = '123'
Column selection:
-- BAD: Selects all columns
SELECT * FROM users
-- GOOD: Only needed columns
SELECT user_id, name, email FROM users
Predicate pushdown:
-- BAD: Filter after join
SELECT * FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.date = '2024-01-15'
-- GOOD: Filter before join
SELECT * FROM (
SELECT * FROM orders WHERE date = '2024-01-15'
) o
JOIN users u ON o.user_id = u.user_id
Expected savings: 30-50% on query costs.
3. Caching and Materialization¶
Materialized views:
-- Pre-compute common aggregations
CREATE MATERIALIZED VIEW daily_user_stats AS
SELECT
date,
user_id,
COUNT(*) as event_count,
SUM(amount) as total_amount
FROM events
GROUP BY date, user_id
-- Refresh incrementally
REFRESH MATERIALIZED VIEW daily_user_stats;
Application caching:
# Cache frequent queries
@cache(ttl=3600) # 1 hour
def get_user_stats(user_id):
return query(f"SELECT * FROM user_stats WHERE user_id = {user_id}")
Expected savings: 40-60% on repeated queries.
4. Compression¶
Storage compression: - Parquet with Snappy: 2-3x - Parquet with Zstd: 4-6x - Parquet with Brotli: 5-7x (slower)
Recommendation: Use Zstd for best balance.
Expected savings: 50-70% on storage.
5. Incremental Processing¶
Problem: Reprocessing all data every time.
Solution: Only process new/changed data.
# Full reprocess (expensive)
df = spark.read.table("raw.events")
processed = transform(df)
processed.write.save("curated.events")
# Incremental (cheap)
last_processed = get_last_processed_timestamp()
df = spark.read.table("raw.events") \
.filter(f"ingestion_timestamp > '{last_processed}'")
processed = transform(df)
processed.write.mode("append").save("curated.events")
update_last_processed_timestamp()
Expected savings: 80-95% on transformation costs.
6. Spot Instances / Preemptible¶
Use for: - Batch jobs (can tolerate interruption) - Non-critical workloads - Cost-sensitive use cases
Avoid for: - Streaming (needs continuous availability) - Critical pipelines - Low-latency requirements
Expected savings: 60-80% on compute.
Streaming vs Micro-Batch¶
When to Use Streaming¶
Use streaming when: - Real-time requirement (< 1 minute) - Event-driven architecture - Low-latency use cases (fraud, recommendations)
Cost: 3-5x batch
When to Use Micro-Batch¶
Use micro-batch when: - Near real-time acceptable (1-15 minutes) - Cost-sensitive - Can tolerate small delays
Implementation:
# Micro-batch: Process every 5 minutes
schedule = "*/5 * * * *" # Every 5 minutes
# Instead of continuous streaming
# Process accumulated events in batches
Cost: 1.5-2x batch
Expected savings: 50-70% vs streaming.
Zombie Pipeline Detection¶
The Problem¶
Zombie pipelines: Running but producing no value.
Symptoms: - Zero downstream consumers - No queries in 90+ days - High cost, zero usage - Still running "just in case"
Detection¶
-- Find zombie pipelines
SELECT
pipeline_name,
daily_cost,
last_consumer_query,
days_since_last_use,
CASE
WHEN days_since_last_use > 90 THEN 'ZOMBIE'
ELSE 'ACTIVE'
END as status
FROM pipeline_usage_stats
WHERE status = 'ZOMBIE'
ORDER BY daily_cost DESC
Action Plan¶
- Identify: Find zombies (query above)
- Verify: Confirm no usage (check consumers)
- Notify: Alert owners
- Archive: Move to cold storage
- Delete: Remove if truly unused
Expected savings: 5-15% of total cost.
Cost Monitoring¶
Key Metrics¶
Cost per GB ingested:
SELECT
source,
SUM(ingestion_cost) / SUM(volume_gb) as cost_per_gb
FROM ingestion_costs
GROUP BY source
ORDER BY cost_per_gb DESC
Cost per query:
SELECT
query_type,
AVG(cost) as avg_cost,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cost) as p95_cost
FROM query_costs
GROUP BY query_type
Cost trends: - Week-over-week growth - Month-over-month growth - Anomaly detection (spikes)
Cost Attribution¶
By team:
SELECT
team,
SUM(cost) as total_cost,
SUM(cost) / SUM(SUM(cost)) OVER () as pct_of_total
FROM cost_attribution
GROUP BY team
ORDER BY total_cost DESC
By source:
SELECT
source,
SUM(cost) as total_cost
FROM cost_attribution
GROUP BY source
ORDER BY total_cost DESC
By consumer:
SELECT
consumer,
SUM(query_cost) as total_cost
FROM query_costs
GROUP BY consumer
ORDER BY total_cost DESC
Alerting¶
Cost alerts: - Daily cost > threshold - Cost spike > 20% (day-over-day) - Unusual usage pattern - Budget exceeded
Practical Cost Reduction Plan¶
Phase 1: Quick Wins (Week 1-2)¶
- Identify unused tables → Archive (10-20% savings)
- Convert JSON to Parquet → Storage optimization (50-70% savings)
- Enable lifecycle policies → Tier old data (50-70% savings)
- Compact small files → Query optimization (20-30% savings)
Expected total: 20-30% reduction
Phase 2: Optimization (Week 3-4)¶
- Right-size compute → Match to workload (20-30% savings)
- Optimize queries → Partition pruning, column selection (30-50% savings)
- Incremental processing → Only process new data (80-95% savings)
- Materialize views → Pre-compute aggregations (40-60% savings)
Expected total: Additional 15-25% reduction
Phase 3: Architecture (Month 2+)¶
- Evaluate streaming vs batch → Use batch when possible (50-70% savings)
- Eliminate zombies → Remove unused pipelines (5-15% savings)
- Centralize processing → Eliminate redundancy (10-20% savings)
- Spot instances → For batch jobs (60-80% savings)
Expected total: Additional 10-20% reduction
Overall Target¶
Total expected savings: 40-60% with all optimizations.
Cost-Benefit Analysis¶
When to Optimize¶
Optimize when: - Cost > $10K/month (worth engineering time) - Cost growing > 20%/month (unsustainable) - Specific pain point (e.g., query slowness)
Don't optimize when: - Cost < $1K/month (engineering time > savings) - One-time spike (investigate, but don't over-optimize) - Premature (optimize after you have data)
ROI Calculation¶
# Example: Query optimization project
engineering_time = 40 # hours
hourly_rate = 150 # $/hour
engineering_cost = engineering_time * hourly_rate # $6,000
monthly_savings = 5000 # $/month
annual_savings = monthly_savings * 12 # $60,000
roi = (annual_savings - engineering_cost) / engineering_cost # 900%
payback_period = engineering_cost / monthly_savings # 1.2 months
Rule of thumb: If payback < 3 months, do it.
Next Steps¶
- Tooling Landscape - Tools for cost optimization
- Leadership View - Measuring and reporting costs