Airflow
Apache Airflow¶
Workflow orchestration for data pipelines.
Overview¶
Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. It's the de facto standard for data pipeline orchestration.
Key Concepts¶
DAGs (Directed Acyclic Graphs)¶
DAG = Workflow definition
Example:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
dag = DAG(
'ingest_data',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily'
)
extract = BashOperator(
task_id='extract',
bash_command='python extract.py',
dag=dag
)
transform = BashOperator(
task_id='transform',
bash_command='python transform.py',
dag=dag
)
load = BashOperator(
task_id='load',
bash_command='python load.py',
dag=dag
)
extract >> transform >> load
Tasks¶
Task = Single unit of work
Types: - Operators (Bash, Python, SQL) - Sensors (wait for conditions) - Hooks (connect to external systems)
Operators¶
BashOperator - Run bash commands PythonOperator - Run Python functions SQLOperator - Run SQL queries
Best Practices¶
- Idempotency - Tasks should be rerunnable
- Atomicity - Tasks should succeed or fail completely
- Dependencies - Use clear task dependencies
- Error handling - Handle failures gracefully
- Monitoring - Set up alerts for failures
Related Topics¶
- dbt - SQL-based transformations
- Data Orchestration - Orchestration overview
Next: dbt →