Why use Apache Airflow for data migration?

Apache Airflow simplifies complex ETL/data migration tasks by automating dependencies through DAGs, handling retries & failures, supporting 300+ integrations via Python, and providing a UI for pipeline monitoring.

What are Airflow DAGs and how do they help in data migration?

DAGs (Directed Acyclic Graphs) define task workflows with clear dependencies. For data migration, a DAG might sequence tasks like extract_data → transform_data → load_to_warehouse, enabling parallel execution, failure isolation, and audit trails.

How do you handle incremental data loads in Airflow?

Use Sensors (e.g., S3KeySensor) to trigger jobs on new data arrival, XComs to pass metadata like timestamps between tasks, and custom hooks (e.g., DbApiHook) to fetch only delta records.

What Airflow operators are best for database migrations?

Key operators include PostgresOperator/MySqlOperator for SQL commands, PythonOperator for custom logic, DockerOperator for containerized tasks, and KubernetesPodOperator for scalable workloads

How can Airflow improve ETL pipeline reliability?

Airflow provides retry mechanisms, Slack/email alerts for failures, backfilling capabilities to reprocess historical data, and task isolation to prevent cascading failures.

Automating Data Migration Using Apache Airflow: A Step-by-Step Guide

In this second part of our blog, we’ll walk through how we automated the migration process using Apache Airflow. We’ll cover everything from unloading data from Amazon Redshift to S3, transferring it to Google Cloud Storage (GCS), and finally loading it into Google BigQuery. This comprehensive process was orchestrated with Airflow to make sure every step was executed smoothly, automatically, and without error.

Step 1: Setting Up Apache Airflow

Before diving into the migration tasks, we first need to ensure that Apache Airflow is properly set up. Here’s how I set it up for my project.

1. Install Apache Airflow: If you don’t have Airflow installed, use the following command to install it via pip:

pip                      install                              apache-airflow

2. Initialize the Airflow Database: Airflow requires a backend database to track tasks and maintain state. To initialize it, run:

airflow db init

3. Start the Web Server and Scheduler: Once Airflow is initialized, start the web server (for monitoring) and scheduler (to run tasks):

airflow webserver --port 8080 airflow scheduler

Set Up Connections in Airflow:

Redshift: Create a connection in the Airflow UI under Admin → Connections for Redshift (with your host, database, username, and password).

AWS: Set up an AWS connection with your AWS Access Key and Secret Access Key.

Google Cloud: Set up a connection for Google Cloud using a service account (we’ll go into this below).

[ Also Read Part 1: How to Optimize Amazon Redshift for Faster and Seamless Data Migration]

Step 2: Setting Up Google Cloud Resources

Before we can transfer data from S3 to Google Cloud Storage (GCS) and eventually to BigQuery, we need to configure the necessary resources in Google Cloud.

Create a Google Cloud Project

Go to the Google Cloud Console.

Create a new project (e.g., migration-project).

Enable the Necessary APIs

To interact with Google Cloud services, you need to enable the following APIs:

Google Cloud Storage API

BigQuery API

Go to the APIs & Services → Library, search for these APIs, and enable them.

Create a Service Account

Next, you’ll need to create a service account with the appropriate permissions:

Go to the IAM & Admin → Service Accounts section in Google Cloud.

Click Create Service Account, give it a name, and provide the following roles:

Storage Admin (for access to Google Cloud Storage)

BigQuery Data Editor (for BigQuery access)

After creating the service account, you’ll be prompted to generate a key for it:

Select Create Key and choose the JSON format.

Download the key file and save it securely.

This key file will be used to authenticate Airflow with Google Cloud.

Set Up IAM Roles

Make sure to assign these roles to the service account to ensure it has access to perform actions on GCS and BigQuery.

Step 3: Unloading Data from Redshift to S3 with Airflow

Once the resources in Google Cloud were ready, the first step was to unload data from Amazon Redshift into Amazon S3. I used the PostgresOperator in Airflow to execute a Redshift UNLOAD command.

Here’s the Airflow DAG (dag1.py) for unloading data from Redshift:

https://github.com/ramneek2109/DataMigration/blob/main/dag1.py

How it works:

The PostgresOperator runs a Redshift UNLOAD query that transfers data from the users table to Amazon S3.

The CREDENTIALS part references the IAM role for accessing S3.

Step 4: Transferring Data from S3 to Google Cloud Storage (GCS)

After the data was unloaded to S3, the next step was to move it to Google Cloud Storage (GCS). This was done using the S3ToGCSOperator in Airflow.

Here’s the Airflow DAG (dag2.py) for transferring data from S3 to GCS:

https://github.com/ramneek2109/DataMigration/blob/main/dag2.py

How it works:

S3ToGCSOperator transfers files from Amazon S3 to Google Cloud Storage (GCS).

The list_s3_files task ensures that we know what files are being transferred.

Step 5: Loading Data from GCS to BigQuery

After transferring the data to Google Cloud Storage (GCS), the final step in the pipeline was to load the data into BigQuery. Instead of loading the entire dataset sequentially, we used parallel task execution in Airflow to improve performance and reduce load time. Here’s how we achieved this:

Dynamic Task Creation for Parallelism

Instead of writing a static Airflow task for loading data from GCS to BigQuery, we leveraged Airflow’s ability to dynamically create tasks for each file in the GCS bucket. By doing so:

Each file was processed as an independent task.

These tasks ran in parallel, utilizing the underlying infrastructure to maximize throughput.

This approach works exceptionally well for large datasets split into multiple files because it allows BigQuery to ingest data efficiently across multiple streams.

Here’s the Airflow DAG (dag3.py) for loading data into BigQuery:

https://github.com/ramneek2109/DataMigration/blob/main/dag3.py

How it works:

GCSToBigQueryOperator loads the files from Google Cloud Storage into BigQuery, using the schema I defined.

The table is partitioned by created_at and clustered by user_id and last_login for fast and efficient querying.

Wrapping Up

With the combination of Redshift, Airflow, S3, Google Cloud Storage, and BigQuery, we successfully automated our data migration pipeline. By orchestrating the entire process with Airflow, we ensured that the data migrated efficiently, without a hitch, and in a way that optimizes performance in BigQuery.

I hope this guide provides valuable insights for your next data migration project. Feel free to reach out with any questions or feedback!

Author: Ramneek Kaur

I am a Data Engineer at OpsTree Global, currently contributing to major data initiatives for enterprise customers.I have deep expertise in designing and implementing scalable ETL pipelines, building robust cloud-native data platforms, and enabling real-time analytics using tools like BigQuery, GCP, Airflow, Kafka, and Mesh.My work focuses heavily on query performance tuning, pipeline reliability, and cost optimization, ensuring that data systems remain both efficient and resilient under scale. I hold a B.Tech degree in Computer Science and am passionate about building future-safe, maintainable data ecosystems that empower data-driven decision-making in organizations. View all posts by Ramneek Kaur