Apache Spark

Apache Spark¶

Distributed data processing at scale.

Overview¶

Apache Spark is a unified analytics engine for large-scale data processing. It's the standard for batch and streaming data processing.

Key Concepts¶

RDDs (Resilient Distributed Datasets)¶

RDD = Immutable distributed collection

Example:

rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.map(lambda x: x * 2).collect()
# [2, 4, 6, 8, 10]

DataFrames¶

DataFrame = Structured data with schema

Example:

df = spark.read.parquet("s3://data/events/")
df.filter(df.date == "2024-01-15") \
  .groupBy("user_id") \
  .agg(sum("amount").alias("total")) \
  .show()

Datasets¶

Dataset = Typed DataFrame (Scala/Java)

Best Practices¶

Partitioning - Partition data appropriately
Caching - Cache frequently used data
Broadcast joins - Broadcast small tables
Avoid shuffles - Minimize data movement
Resource tuning - Tune executor memory/cores

Data Processing - Processing overview
Data Architecture - Storage patterns

Next: BigQuery →