In this second part of our blog, we’ll walk through how we automated the migration process using Apache Airflow. We’ll cover everything from unloading data from Amazon Redshift to S3, transferring it to Google Cloud Storage (GCS), and finally loading it into Google BigQuery. This comprehensive process was orchestrated with Airflow to make sure every step was executed smoothly, automatically, and without error.
Step 1: Setting Up Apache Airflow
Before diving into the migration tasks, we first need to ensure that Apache Airflow is properly set up. Here’s how I set it up for my project.
1. Install Apache Airflow: If you don’t have Airflow installed, use the following command to install it via pip:
pip install apache-airflow
2. Initialize the Airflow Database: Airflow requires a backend database to track tasks and maintain state. To initialize it, run:
airflow db init
3. Start the Web Server and Scheduler: Once Airflow is initialized, start the web server (for monitoring) and scheduler (to run tasks):
airflow webserver --port 8080 airflow scheduler

- Set Up Connections in Airflow:
- Redshift: Create a connection in the Airflow UI under Admin → Connections for Redshift (with your host, database, username, and password).
- AWS: Set up an AWS connection with your AWS Access Key and Secret Access Key.
- Google Cloud: Set up a connection for Google Cloud using a service account (we’ll go into this below).
[ Also Read Part 1: How to Optimize Amazon Redshift for Faster and Seamless Data Migration]
Step 2: Setting Up Google Cloud Resources
Before we can transfer data from S3 to Google Cloud Storage (GCS) and eventually to BigQuery, we need to configure the necessary resources in Google Cloud.
Create a Google Cloud Project
- Go to the Google Cloud Console.
- Create a new project (e.g., migration-project).
Enable the Necessary APIs
To interact with Google Cloud services, you need to enable the following APIs:
- Google Cloud Storage API
- BigQuery API
Go to the APIs & Services → Library, search for these APIs, and enable them.
Create a Service Account
Next, you’ll need to create a service account with the appropriate permissions:
- Go to the IAM & Admin → Service Accounts section in Google Cloud.
- Click Create Service Account, give it a name, and provide the following roles:
- Storage Admin (for access to Google Cloud Storage)
- BigQuery Data Editor (for BigQuery access)


After creating the service account, you’ll be prompted to generate a key for it:
- Select Create Key and choose the JSON format.
- Download the key file and save it securely.
This key file will be used to authenticate Airflow with Google Cloud.
Set Up IAM Roles
Make sure to assign these roles to the service account to ensure it has access to perform actions on GCS and BigQuery.
Step 3: Unloading Data from Redshift to S3 with Airflow
Once the resources in Google Cloud were ready, the first step was to unload data from Amazon Redshift into Amazon S3. I used the PostgresOperator in Airflow to execute a Redshift UNLOAD command.
Here’s the Airflow DAG (dag1.py) for unloading data from Redshift:
https://github.com/ramneek2109/DataMigration/blob/main/dag1.py


How it works:
- The PostgresOperator runs a Redshift UNLOAD query that transfers data from the users table to Amazon S3.
- The CREDENTIALS part references the IAM role for accessing S3.
Step 4: Transferring Data from S3 to Google Cloud Storage (GCS)
After the data was unloaded to S3, the next step was to move it to Google Cloud Storage (GCS). This was done using the S3ToGCSOperator in Airflow.
Here’s the Airflow DAG (dag2.py) for transferring data from S3 to GCS:
https://github.com/ramneek2109/DataMigration/blob/main/dag2.py

How it works:
- S3ToGCSOperator transfers files from Amazon S3 to Google Cloud Storage (GCS).
- The list_s3_files task ensures that we know what files are being transferred.
Step 5: Loading Data from GCS to BigQuery
After transferring the data to Google Cloud Storage (GCS), the final step in the pipeline was to load the data into BigQuery. Instead of loading the entire dataset sequentially, we used parallel task execution in Airflow to improve performance and reduce load time. Here’s how we achieved this:
Dynamic Task Creation for Parallelism
Instead of writing a static Airflow task for loading data from GCS to BigQuery, we leveraged Airflow’s ability to dynamically create tasks for each file in the GCS bucket. By doing so:
- Each file was processed as an independent task.
- These tasks ran in parallel, utilizing the underlying infrastructure to maximize throughput.
This approach works exceptionally well for large datasets split into multiple files because it allows BigQuery to ingest data efficiently across multiple streams.
Here’s the Airflow DAG (dag3.py) for loading data into BigQuery:
https://github.com/ramneek2109/DataMigration/blob/main/dag3.py


How it works:
- GCSToBigQueryOperator loads the files from Google Cloud Storage into BigQuery, using the schema I defined.
- The table is partitioned by created_at and clustered by user_id and last_login for fast and efficient querying.
Wrapping Up
With the combination of Redshift, Airflow, S3, Google Cloud Storage, and BigQuery, we successfully automated our data migration pipeline. By orchestrating the entire process with Airflow, we ensured that the data migrated efficiently, without a hitch, and in a way that optimizes performance in BigQuery.
I hope this guide provides valuable insights for your next data migration project. Feel free to reach out with any questions or feedback!