Platform & Operating Model

Platform & Operating Model¶

"The biggest opportunity for managers isn't better data — it's making data problems understandable."

Building a data platform isn't just about technology—it's about creating an operating model that enables teams to move fast while maintaining quality, cost control, and reliability. This chapter covers how to structure your platform organization and processes.

"If Gen-Z doesn't care about your data problem, you've explained the wrong problem."

Central Platform vs Domain Ownership¶

The Spectrum¶

Fully Centralized ←──────────────────────────────→ Fully Decentralized
(Platform Team)                                    (Domain Teams)

Central Platform Model¶

Structure: - Central platform team owns infrastructure - Domain teams consume platform services - Platform team builds self-serve capabilities

Pros: - Consistency across organization - Economies of scale - Centralized expertise - Easier governance

Cons: - Can become bottleneck - May not understand domain needs - Slower to adapt

Best for: - Large organizations (1000+ engineers) - Need for strong governance - Limited data engineering expertise in domains

Domain Ownership Model¶

Structure: - Domain teams own their data end-to-end - Platform provides base infrastructure only - Teams responsible for quality, cost, SLAs

Pros: - Faster iteration - Domain expertise - Ownership and accountability

Cons: - Inconsistency - Duplication - Harder governance

Best for: - Smaller organizations - High domain expertise - Need for speed over consistency

Hybrid Model (Recommended)¶

Structure: - Platform team owns: Infrastructure, standards, tooling - Domain teams own: Business logic, transformations, quality - Shared ownership: Governance, cost optimization

Responsibilities Matrix:

Area	Platform Team	Domain Teams	Shared
Infrastructure	✅
Ingestion pipelines	✅ (self-serve)	✅ (business logic)
Transformations		✅
Data quality		✅	✅ (standards)
Cost optimization	✅ (tools)	✅ (usage)	✅
Governance	✅ (framework)	✅ (compliance)	✅

Key principle: Platform enables, domains execute.

Paved Paths and Escape Hatches¶

Paved Paths¶

Definition: Standardized, supported, well-documented ways to accomplish common tasks.

Examples: - Standard ingestion patterns (CDC, batch, streaming) - Pre-configured compute environments (Spark, Flink) - Standard storage formats (Parquet, Delta) - Approved tooling (dbt, Airflow)

Benefits: - Faster onboarding - Consistency - Easier maintenance - Better observability

Implementation:

# Example: Standard ingestion template
ingestion_template:
  type: cdc
  source: postgres
  destination: gcs://raw/{source_name}
  format: parquet
  partition_by: [date]
  schema_registry: enabled
  monitoring: enabled

Escape Hatches¶

Definition: Approved ways to deviate from paved paths when needed.

When to use: - Unique requirements not met by standard paths - Performance optimization - Experimental patterns

Process: 1. Document why standard path doesn't work 2. Get approval (platform team review) 3. Implement with monitoring 4. Evaluate for promotion to paved path

Example:

Standard: Use Dataflow for streaming
Escape hatch: Use Flink for stateful processing (approved use case)

Principle: Make it easy to use paved paths, possible but reviewed to use escape hatches.

Contract-First Ingestion¶

The Problem¶

Without contracts, you get: - Schema drift breaking downstream - Unclear SLAs - Ownership confusion - Cost attribution issues

The Solution: Data Contracts¶

Contract definition:

source: user_events
owner: analytics-team@company.com
sla:
  freshness: 15 minutes
  availability: 99.9%
schema:
  version: 1.0
  fields:
    - name: user_id
      type: string
      required: true
    - name: event_type
      type: string
      enum: [click, view, purchase]
  evolution: backward_compatible
quality:
  completeness: > 99%
  uniqueness: > 99.9%
cost_attribution: analytics-team

Contract Enforcement¶

At ingestion: 1. Validate schema matches contract 2. Check quality metrics 3. Reject if contract violated

In platform: 1. Store contracts in registry 2. Version contracts 3. Notify on violations 4. Track compliance

Tools: DataHub, Great Expectations, custom validators

Benefits¶

Predictability: Downstream knows what to expect
Quality: Issues caught early
Ownership: Clear accountability
Evolution: Controlled schema changes

Cost Attribution and Accountability¶

The Problem¶

Without attribution: - "The platform is expensive" (but who's using it?) - No incentive to optimize - Hard to justify investments

Solution: Cost Attribution¶

Attribution dimensions: - Team: Which team owns the data/pipeline - Project: Which project/business unit - Source: Which source system - Consumer: Which downstream consumers

Implementation:

-- Example: Cost attribution query
SELECT
  team,
  source,
  SUM(storage_cost) as storage_cost,
  SUM(compute_cost) as compute_cost,
  SUM(total_cost) as total_cost
FROM cost_attribution
WHERE date >= CURRENT_DATE - 30
GROUP BY team, source
ORDER BY total_cost DESC

Tools: - Cloud cost management (AWS Cost Explorer, GCP Billing) - Custom attribution tags - DataHub cost tracking

Showback vs Chargeback¶

Showback (recommended): - Show costs to teams - Create awareness - Encourage optimization - No actual billing

Chargeback: - Actually bill teams - Stronger incentive - More complex (billing systems) - Can create friction

Recommendation: Start with showback. Move to chargeback only if needed.

Cost Accountability¶

Monthly reviews: 1. Top spenders by team 2. Cost trends (growth, anomalies) 3. Optimization opportunities 4. ROI of investments

Goals: - Teams see their costs - Teams understand cost drivers - Teams optimize proactively

Self-Serve Capabilities¶

Ingestion Self-Serve¶

Capabilities: - Web UI or CLI to register new sources - Automatic pipeline generation - Schema discovery and validation - Monitoring setup

Example flow:

# Developer registers new source
platform ingest register \
  --source postgres://db.example.com/users \
  --destination gcs://raw/users \
  --sla 15min \
  --owner analytics-team

# Platform automatically:
# - Creates CDC pipeline
# - Sets up monitoring
# - Creates contract
# - Provisions resources

Benefits: - Faster time to value (hours vs weeks) - Reduced platform team load - Consistency (standard patterns)

Transformation Self-Serve¶

Capabilities: - Managed compute (Spark, Flink clusters) - Standard libraries and frameworks - CI/CD integration - Testing frameworks

Example:

# Developer writes transformation
@platform.transform(
    input="raw.events",
    output="curated.user_events",
    schedule="hourly"
)
def transform_events(df):
    return df.filter(df.event_type == "purchase")

Platform handles: - Resource provisioning - Scheduling - Monitoring - Error handling

Discovery Self-Serve¶

Capabilities: - Data catalog (search, browse) - Schema documentation - Lineage visualization - Usage statistics

Tools: DataHub, Collibra, custom catalogs

Platform Team Structure¶

Core Team Roles¶

Platform Engineers: - Build and maintain infrastructure - Develop self-serve capabilities - Optimize platform performance

Data Engineers (Platform): - Design ingestion patterns - Build transformation frameworks - Create best practices

SRE / DevOps: - Reliability and observability - Incident response - Capacity planning

Product Managers: - Platform roadmap - User needs (domain teams) - Success metrics

Team Size Guidelines¶

Small organization (< 100 engineers): - 2-3 platform engineers - Part-time SRE - No dedicated PM

Medium organization (100-500 engineers): - 5-10 platform engineers - 1-2 SRE - 1 PM

Large organization (500+ engineers): - 15-30 platform engineers - 3-5 SRE - 2-3 PM - Dedicated cost optimization team

Success Metrics¶

Platform Health¶

Adoption: - % of data sources using platform - % of transformations on platform - Active users per month

Reliability: - Platform uptime (target: 99.9%) - Pipeline success rate (target: > 99%) - Mean time to recovery (MTTR)

Performance: - Ingestion latency (p50, p95, p99) - Query performance (p50, p95, p99) - Resource utilization

Developer Experience¶

Time to value: - Time to first ingestion (target: < 1 day) - Time to first transformation (target: < 2 days)

Developer satisfaction: - NPS or survey scores - Support ticket volume - Documentation usage

Self-serve adoption: - % of pipelines created via self-serve - % of transformations using standard frameworks

Cost Efficiency¶

Cost per GB ingested: - Track over time - Compare to industry benchmarks - Optimize continuously

Cost per query: - Average cost - Cost by query type - Optimization opportunities

Total cost of ownership: - Platform infrastructure cost - Operational overhead - Developer time saved

Operating Model Maturity¶

Level 1: Ad-Hoc¶

Manual pipeline creation
No standards
Limited self-serve
High operational burden

Level 2: Standardized¶

Common patterns documented
Some self-serve capabilities
Basic governance
Platform team bottleneck

Level 3: Self-Serve Platform¶

Most tasks self-serve
Clear contracts and SLAs
Cost attribution
Platform enables, doesn't block

Level 4: Product Platform¶

Full self-serve
Predictive quality
Automated optimization
Platform as competitive advantage

Next Steps¶

Quality, Governance & Observability - How to ensure quality and govern data
Cost Efficiency & Scale - Advanced cost optimization