en / de
AI
Expertisen
Methoden
Dienstleistungen
Referenzen
Jobs & Karriere
Firma
Technologie-Trends TechCast WebCast TechBlog News Events Academy

Apache Airflow im devcontainer

Introduction to Apache Airflow

Apache Airflow has revolutionized the way organizations orchestrate complex data workflows. Created by Airbnb in 2014 and later donated to the Apache Software Foundation, Airflow provides a programmatic approach to authoring, scheduling, and monitoring workflows.

At its core, Airflow allows you to define your data pipelines as code using Python, making them maintainable, versionable, and testable. These workflows are represented as Directed Acyclic Graphs (DAGs), which consist of individual tasks and their dependencies. This structure ensures that tasks are executed in the correct order, with proper handling of dependencies, retries, and failures.

Organizations across industries rely on Airflow for:

What sets Airflow apart is its flexibility and extensibility. With hundreds of pre-built operators, hooks, and sensors, it easily integrates with cloud services, databases, messaging systems, and more. Its web UI provides real-time monitoring and troubleshooting capabilities, while its scheduler ensures reliable execution based on defined schedules or external triggers.

Development with Dev Containers

When developing Airflow DAGs, having a consistent environment that mirrors production is crucial. This is where Dev Containers come in. Dev Containers allow developers to use a Docker container as a full-featured development environment, providing:

Visual Studio Code’s Dev Container extension makes this process seamless, automatically connecting to the container and providing a fully-featured development experience within it.

Let’s explore how we can leverage Dev Containers for Airflow development, starting with a simple setup and then enhancing it for parallel task execution.

Prerequisites

Before we dive into setting up Airflow with Dev Containers, make sure you have the following installed on your system:

You don’t need to install Python or Airflow directly on your machine, as we’ll be running everything inside containers. This is one of the key benefits of the Dev Container approach – it isolates dependencies from your local system.

Simple Airflow Dev Container

Our first Dev Container provides a minimal setup to get started with Airflow. The configuration consists of three key files in the .devcontainer directory:

  1. Dockerfile – Builds a Python image with Airflow installed
  2. devcontainer.json – Configures VS Code integration and startup commands
  3. .env – Sets Airflow environment variables

Here’s how our project structure looks for this simple setup:

my-airflow-project/
├── .devcontainer/
│   ├── Dockerfile
│   ├── devcontainer.json
│   ├── .env
│   └── requirements.txt
└── dags/ 
    ├── hello_world.py 
    └── parallel_tasks.py

 

Let’s look at how these files are configured:

.devcontainer/Dockerfile

FROM mcr.microsoft.com/devcontainers/python:3.10

RUN pip install "apache-airflow==2.10.5" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.5/constraints-3.10.txt"

WORKDIR /app
COPY requirements.txt requirements.txt  
RUN pip install --no-cache-dir -r requirements.txt  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.5/constraints-3.10.txt"

 

This Dockerfile starts with Microsoft’s Python 3.10 dev container image and installs Apache Airflow 2.10.5 with its dependencies.

.devcontainer/devcontainer.json

{
    "name": "Airflow devcontainer",
    "dockerFile": "Dockerfile",
    "runArgs": [
        "--env-file",
        ".devcontainer/.env"
    ],
    "postStartCommand": {
        "start airflow": "nohup bash -c 'airflow standalone &'",
        "create airflow user": "airflow users create --username airflow --password airflow --role Admin --email '' --firstname '' --lastname ''"
    },
    "forwardPorts": [
        8080
    ]
}

 

As soon as the devcontainer is started, the devcontainer will:

.devcontainer/.env

AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW__CORE__DAGS_FOLDER=dags

 

These environment variables configure Airflow to:

.devcontainer/requirements.txt

This file can contain additional Python dependencies for your Airflow tasks. For now, we can leave it empty (but still create it).

Testing with a Simple DAG

Let’s create a simple DAG to test our setup:

dags/hello_world.py

from airflow import DAG
from airflow.operators.python import PythonOperator

def my_task():
    print("Hello, Airflow!")

# Define the DAG
with DAG(
    dag_id="my_dag",
    schedule=None,  # Run manually
) as dag:
    task = PythonOperator(
        task_id="print_hello",
        python_callable=my_task,
    )

 

This DAG contains a single task that prints «Hello, Airflow!» when executed.

After starting our Dev Container (CTRL+SHIFT+P -> Dev Containers: Rebuild and Reopen in Container) and accessing the Airflow UI at http://localhost:8080 (username: airflow, password: airflow), we can trigger our DAG manually and see it run successfully.

The Limitation: Parallelism with SQLite

While our simple setup works for basic DAGs, it has a significant limitation: it can’t execute tasks in parallel. By default, Airflow uses SQLite as its metadata database, which doesn’t support multiple concurrent connections due to its locking mechanism. To demonstrate this limitation, let’s create a DAG with multiple tasks:

dags/parallel_tasks.py

import time
from airflow import DAG
from airflow.operators.python import PythonOperator

def my_task():
    time.sleep(3)
    print("Hello, Airflow!")

with DAG(
    dag_id="my_parallel_dag",
    schedule=None,  # Run manually
) as dag:
    for i in range(10):
        task = PythonOperator(
            task_id=f"print_hello_{i}",
            python_callable=my_task,
        )

 

When running this DAG in our simple setup (it might take a few minutes until airflow picks up the new DAG), the tasks will execute sequentially, one after another, rather than in parallel. For workflows with many tasks, this can significantly increase execution time.

Enhanced Airflow Dev Container with PostgreSQL

To overcome this limitation, we need to replace SQLite with a database that supports concurrent connections, like PostgreSQL. Our enhanced Dev Container adds PostgreSQL as a sidecar service using Docker Compose.

Here’s the updated project structure for our enhanced setup:

my-airflow-project/
├── .devcontainer/
│   ├── Dockerfile
│   ├── devcontainer.json
│   ├── .env
│   ├── docker-compose.yaml
│   ├── init.sql
│   └── requirements.txt
└── dags/
    ├── hello_world.py
    └── parallel_tasks.py

 

.devcontainer/docker-compose.yaml

name: airflow
services:
  postgres:
    image: postgres
    environment:
      - POSTGRES_PASSWORD=root
      - POSTGRES_USER=postgres
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql

 

.devcontainer/init.sql

CREATE DATABASE airflow_db;
CREATE USER airflow WITH PASSWORD 'airflow';
ALTER DATABASE airflow_db OWNER TO airflow;
GRANT ALL PRIVILEGES ON DATABASE airflow_db TO airflow;
GRANT ALL ON SCHEMA public TO airflow;

 

This SQL script creates an Airflow-specific database and user with appropriate permissions.

Updated .devcontainer/.env

AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW__CORE__DAGS_FOLDER=dags
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow_db

 

The key changes here:

Updated .devcontainer/devcontainer.json

{
    "name": "Airflow devcontainer",
    "dockerFile": "Dockerfile",
    "runArgs": [
        "--env-file",
        ".devcontainer/.env",
        "--network=airflow_default"
    ],
    "initializeCommand": {
        "start db": "docker compose -f ${localWorkspaceFolder}/.devcontainer/docker-compose.yaml up -d --remove-orphans"
    },
    "postStartCommand": {
        "start airflow": "nohup bash -c 'airflow standalone &'",
        "create airflow user": "airflow users create --username airflow --password airflow --role Admin --email '' --firstname '' --lastname ''"
    },
    "forwardPorts": [
        8080
    ]
}

 

Key enhancements:

.devcontainer/requirements.txt

We need to add the PostgreSQL Python client to our requirements.txt:

psycopg2-binary

 

Testing Parallelism

After rebuilding our devcontainer (CTRL+SHIFT+P -> Dev Containers: Rebuild Container) With our enhanced setup, we can now run the same multi-task DAG from earlier, but this time the tasks will execute in parallel. When triggered in the Airflow UI, we’ll see multiple tasks running simultaneously, significantly reducing the overall execution time.

You can verify this by checking the «Grid» view in the Airflow UI for your parallel DAG, where you’ll notice multiple tasks running concurrently rather than sequentially.

Conclusion

Dev Containers provide an excellent way to standardize development environments for Apache Airflow. We’ve seen how to:

  1. Create a basic Airflow Dev Container for simple workflows
  2. Identify the parallelism limitations of the default SQLite backend
  3. Enhance our setup with PostgreSQL to enable parallel task execution

This approach brings several benefits:

With this setup, you can develop and test even complex Airflow DAGs with parallel tasks in an environment that closely resembles production, all within the comfort of your favorite IDE.

For more advanced scenarios, you might consider extending this setup to include additional services like Redis for the Celery executor, or even integrating with Kubernetes for dynamic task allocation.

References

Kommentare

Schreiben Sie einen Kommentar

Ihre E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Newsletter - aktuelle Angebote, exklusive Tipps und spannende Neuigkeiten

 Jetzt anmelden

Copyright © 2025 Noser Engineering AG – Alle Rechte vorbehalten.

NACH OBEN
Privacy Policy Cookie Policy
Zur Webcast Übersicht