Skip to content

Tooling Landscape

Tooling Landscape

The data engineering tooling ecosystem is vast and rapidly evolving. This chapter provides an opinionated guide to selecting and using tools effectively, organized by category.

Tool Selection Principles

Criteria

  1. Fit for purpose: Does it solve your specific problem?
  2. Maturity: Is it production-ready? (avoid bleeding edge for critical systems)
  3. Ecosystem: Does it integrate with your stack?
  4. Cost: Total cost of ownership (license + infrastructure + operations)
  5. Vendor lock-in: Can you migrate if needed?
  6. Community: Is there support, documentation, talent?

Decision Framework

For each tool category: 1. Define requirements (must-have vs nice-to-have) 2. Evaluate 2-3 options 3. POC (proof of concept) if significant investment 4. Standardize on one (avoid tool sprawl)

Ingestion Tools

Managed SaaS (Zero Maintenance)

Fivetran - Best for: Database replication, zero maintenance - Pros: Managed, reliable, many connectors - Cons: Expensive at scale, limited customization - Cost: ~$1-2 per 10K rows/month - Use when: Small-medium volume, want zero ops

Stitch - Best for: Simple extracts, cost-effective - Pros: Cheaper than Fivetran, simple - Cons: Fewer connectors, less reliable - Cost: ~$0.5-1 per 10K rows/month - Use when: Cost-sensitive, simple use cases

Airbyte (Open Source) - Best for: Self-hosted, customizable - Pros: Free, open source, extensible - Cons: Requires operations, less mature - Cost: Infrastructure only - Use when: Large scale, need customization

Self-Managed

Debezium (CDC) - Best for: Kafka-based CDC, database replication - Pros: Open source, reliable, Kafka-native - Cons: Requires Kafka infrastructure - Cost: Infrastructure only - Use when: Already using Kafka, need CDC

Kafka Connect - Best for: Kafka ecosystem, extensible connectors - Pros: Many connectors, Kafka-native - Cons: Requires Kafka operations - Cost: Infrastructure only - Use when: Kafka-first architecture

Custom Scripts (Airflow + Spark) - Best for: Full control, custom logic - Pros: Complete flexibility - Cons: High maintenance, slower development - Cost: Infrastructure + engineering time - Use when: Unique requirements, have engineering resources

Cloud-Native

Google Datastream - Best for: GCP-native CDC, managed service - Pros: Managed, GCP-integrated, reliable - Cons: GCP-only, expensive - Cost: ~$0.05-0.10 per GB processed - Use when: GCP stack, need managed CDC

AWS DMS (Database Migration Service) - Best for: AWS-native replication, migrations - Pros: Managed, AWS-integrated - Cons: AWS-only, can be expensive - Cost: Per instance hour + data transfer - Use when: AWS stack, database replication

Orchestration Tools

Apache Airflow

Best for: Complex workflows, Python-based, open source

Pros: - Mature, widely adopted - Rich ecosystem (operators, sensors) - Flexible (Python-based) - Good UI and monitoring

Cons: - Requires operations (self-hosted) - Can be complex for simple workflows - Resource-intensive

Use when: Complex DAGs, need flexibility, have operations team

Alternatives: Prefect, Dagster (modern Python alternatives)

Managed Airflow

Google Cloud Composer - Managed Airflow on GCP - Pros: No ops, GCP-integrated - Cons: Expensive, GCP-only - Cost: ~$0.10 per vCPU-hour

AWS MWAA (Managed Workflows) - Managed Airflow on AWS - Pros: No ops, AWS-integrated - Cons: Expensive, AWS-only - Cost: ~$0.49 per hour base + usage

dbt (Data Build Tool)

Best for: SQL-based transformations, analytics engineering

Pros: - SQL-based (accessible to analysts) - Great testing framework - Documentation generation - Version control friendly

Cons: - SQL-only (limited for complex logic) - Requires orchestration (Airflow, etc.)

Use when: SQL-heavy transformations, analytics team

Prefect / Dagster

Best for: Modern Python orchestration, better DX than Airflow

Pros: - Better developer experience - Modern architecture - Good testing support

Cons: - Less mature than Airflow - Smaller ecosystem

Use when: Starting new, Python-heavy, want modern tooling

Transformation Frameworks

Spark

Best for: Large-scale batch processing, complex transformations

Pros: - Handles petabytes - Rich APIs (SQL, Python, Scala, R) - Mature ecosystem - Good performance

Cons: - Complex (steep learning curve) - Resource-intensive - Requires tuning

Use when: Large volumes, complex logic, need performance

Managed options: Databricks, EMR, Dataproc

Best for: Streaming, stateful processing, low latency

Pros: - Excellent for streaming - Stateful processing - Low latency - Good performance

Cons: - Complex - Steeper learning curve than Spark - Requires operations

Use when: Real-time streaming, stateful processing

Managed options: Ververica, Kinesis Analytics, Dataflow

dbt

Best for: SQL transformations, analytics engineering

Pros: - SQL-based (accessible) - Great testing - Documentation - Modular (macros, models)

Cons: - SQL-only - Limited for complex logic

Use when: Analytics workloads, SQL-heavy

Dataflow (Google)

Best for: GCP-native, auto-scaling, batch + streaming

Pros: - Managed, auto-scaling - Unified batch/streaming - GCP-integrated - Pay per use

Cons: - GCP-only - Can be expensive - Less flexible than Spark/Flink

Use when: GCP stack, want managed, batch + streaming

Storage & Query Engines

Data Warehouses

BigQuery - Best for: GCP-native, serverless, analytics - Pros: Serverless, auto-scaling, fast - Cons: GCP-only, can be expensive - Cost: $5/TB queried, $0.02/GB stored

Snowflake - Best for: Multi-cloud, performance, enterprise - Pros: Multi-cloud, excellent performance, enterprise features - Cons: Expensive, vendor lock-in - Cost: Per credit (compute) + storage

Redshift - Best for: AWS-native, cost-effective at scale - Pros: AWS-integrated, cost-effective - Cons: Requires management, less flexible - Cost: Per instance + storage

Databricks SQL - Best for: Lakehouse, Spark ecosystem - Pros: Lakehouse (Delta), Spark integration - Cons: Can be expensive - Cost: Per DBU (Databricks Unit)

Data Lakes

S3 + Spark / Presto - Best for: Cost-effective, flexible - Pros: Very cheap storage, flexible - Cons: Requires compute engine, more complex - Cost: $0.023/GB (S3 Standard)

GCS + BigQuery - Best for: GCP-native, lakehouse - Pros: Integrated, can query directly - Cons: GCP-only - Cost: Storage + query costs

ADLS + Databricks - Best for: Azure-native, lakehouse - Pros: Azure-integrated, Delta Lake - Cons: Azure-only - Cost: Storage + compute

Lakehouse Formats

Delta Lake - Best for: ACID transactions, time travel, upserts - Pros: ACID, time travel, schema evolution - Cons: Requires compatible engines - Use when: Need updates, time travel

Apache Iceberg - Best for: Open format, multi-engine support - Pros: Open, multi-engine, good performance - Cons: Less mature than Delta - Use when: Want open format, multi-engine

Apache Hudi - Best for: Real-time updates, incremental processing - Pros: Real-time updates, incremental - Cons: Less mature, smaller ecosystem - Use when: Real-time updates needed

Metadata & Discovery

DataHub (LinkedIn)

Best for: Open source, comprehensive metadata

Pros: - Open source - Comprehensive (lineage, ownership, usage) - Good UI - Active community

Cons: - Requires operations - Can be complex to set up

Use when: Want open source, comprehensive solution

Collibra

Best for: Enterprise governance, compliance

Pros: - Enterprise features - Strong governance - Compliance tools

Cons: - Expensive - Can be heavy/overkill

Use when: Enterprise, need strong governance

AWS Glue Catalog

Best for: AWS-native, simple metadata

Pros: - AWS-integrated - Managed - Simple

Cons: - AWS-only - Limited features

Use when: AWS stack, simple needs

Custom Solutions

Build your own: - Pros: Full control, tailored - Cons: High maintenance, reinventing wheel - Use when: Unique requirements, have resources

Observability & Monitoring

Grafana

Best for: Dashboards, visualization

Pros: - Excellent dashboards - Many data sources - Open source - Flexible

Cons: - Requires setup - Alerting can be complex

Use when: Need dashboards, have operations

Datadog

Best for: Full-stack observability, SaaS

Pros: - Comprehensive (metrics, logs, traces) - SaaS (no ops) - Good integrations - Great UI

Cons: - Expensive at scale - Vendor lock-in

Use when: Want SaaS, comprehensive solution

Cloud Native

CloudWatch (AWS) - AWS-native monitoring - Pros: AWS-integrated, managed - Cons: AWS-only, can be expensive

Cloud Monitoring (GCP) - GCP-native monitoring - Pros: GCP-integrated, managed - Cons: GCP-only

Azure Monitor - Azure-native monitoring - Pros: Azure-integrated, managed - Cons: Azure-only

Quality & Testing

Great Expectations

Best for: Data quality testing, validation

Pros: - Comprehensive testing framework - Good integrations - Open source - Active community

Cons: - Can be complex - Requires setup

Use when: Need comprehensive quality testing

dbt Tests

Best for: SQL-based quality checks

Pros: - Integrated with dbt - SQL-based (accessible) - Simple

Cons: - dbt-only - Limited compared to Great Expectations

Use when: Using dbt, simple quality checks

Custom Validators

Build your own: - Pros: Tailored to needs - Cons: Maintenance overhead - Use when: Unique requirements

Recommendation Matrix

Small Team (< 10 engineers)

Ingestion: Fivetran or Stitch (SaaS)
Orchestration: Managed Airflow (Composer/MWAA) or Prefect Cloud
Transformation: dbt + BigQuery/Snowflake
Storage: BigQuery or Snowflake
Metadata: DataHub (self-hosted) or Glue Catalog
Monitoring: Cloud native (CloudWatch/Cloud Monitoring)

Medium Team (10-50 engineers)

Ingestion: Mix (Fivetran for simple, Debezium/Kafka for complex)
Orchestration: Airflow (self-hosted or managed)
Transformation: Spark + dbt
Storage: Lakehouse (S3/GCS + Spark + warehouse)
Metadata: DataHub
Monitoring: Grafana + Prometheus or Datadog

Large Team (50+ engineers)

Ingestion: Kafka Connect + Debezium + custom
Orchestration: Airflow (self-hosted, multiple instances)
Transformation: Spark + Flink + dbt
Storage: Multi-tier (lake + warehouse)
Metadata: DataHub or Collibra
Monitoring: Full stack (Grafana + Datadog + custom)

Tool Sprawl Prevention

Principles

  1. Standardize: One tool per category (avoid 3 different orchestration tools)
  2. Justify exceptions: Require approval for new tools
  3. Regular review: Audit tools annually, deprecate unused
  4. Documentation: Clear guidance on when to use what

Governance

Tool registry: - Approved tools (standard) - Approved with justification (exceptions) - Deprecated (being phased out) - Prohibited (security/compliance issues)

Next Steps