Deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS)
Deploy Apache Airflow on AWS EKS for scalable data pipelines. Step-by-step guide to setup, deploy, and optimize for performance and security.
Saksham Awasthi
August 22, 2024
12
min read
It’s not trivial to run your data pipelines smoothly. Apache Airflow is an excellent option as it has many features and integrations, but it could be better. It requires a lot of heavy lifting to make its infrastructure scalable. That’s where deploying Apache Airflow on Kubernetes comes in. It enables you to orchestrate multiple DAGs in parallel on multiple types of machines, leverage Kubernetes to autoscale nodes, monitor the pipelines, and distribute the processing. This guide will help you prepare your EKS environment, deploy Airflow, and integrate it with essential add-ons. You will also learn about a few suggestions for making your Airflow production grade.
Prerequisites
Before you deploy Apache Airflow, ensure you have all the prerequisites: eksctl, kubectl, helm, and an EKS Cluster.
We’ll be using eksctl to create the EKS Cluster, but feel free to skip it if you already have one.
Setup the AWS & EKSCTL CLIs
1. Install the AWS CLI: (skip to step 2 if you already have the CLI installed)
Apache Airflow is an open-source system for scheduling and enhanced data pipeline orchestration or workflows. In simple terms, Apache Airflow is an ETL/ELT tool. You can create, schedule, and monitor complex workflows in Apache Airflow. You can connect multiple data sources with Airflow and send pipeline success or failure alerts in Slack or Email. In Airflow, you must define Python workflows, represented by Directed Acyclic Graph (DAG). Airflow can be deployed anywhere, and after deploying, you can access Airflow UI and set up workflows.
Use cases of Airflow:
Data ETL Automation: Streamline the extraction, transformation, and loading of data from various sources into storage systems.
Data Processing: Schedule and oversee tasks like data cleansing, aggregation, and enrichment.
Data Migration: Manage data transfer between different systems or cloud platforms.
Model Training: Automate the training of machine learning models on large datasets.
Reporting: Generate and distribute reports and analytics dashboards automatically.
Workflow Automation: Coordinate complex processes with multiple dependencies.
IoT Data: Analyze and process data from IoT devices.
Workflow Monitoring: Track workflow progress and receive alerts for issues.
Benefits of using Airflow in Kubernetes
Deploying Apache Airflow on a Kubernetes cluster offers several advantages over deploying it on a virtual machine:
Scalability: Kubernetes allows you to scale your Airflow deployment horizontally by adding more pods to handle increased workloads automatically.
Isolation: Enables running different tasks of the same pipeline on various cluster nodes by deploying each task as an isolated pod.
Automation: Kubernetes native capabilities, like auto-scaling, self-healing, and rolling updates, reduce manual intervention, improving operational efficiency.
Portability: Deploying on Kubernetes makes your Airflow setup more portable across different environments, whether on-premise or cloud.
Integration: Kubernetes integrates seamlessly with various tools for monitoring, logging, and security, enhancing the overall management of your Airflow deployment.
Airflow Architecture Diagram
The airflow components are the Executor, Scheduler, Web Server, and Airflow database. The Airflow worker and Triggerer are also involved.
As you can see in the above diagram, the Data Engineer writes Airflow DAGs. Airflow DAGs are collections of tasks that specify the dependencies between them and the order in which they are executed. A DAG is a file that contains your Python code.
The Scheduler picks up these DAGs and has the config to run the tasks specified in the DAGs.
In the above diagram, the Scheduler runs tasks using Kubernetes Executor and creates a separate pod for every task, which provides isolation.
Airflow also stores pipeline metadata in an external database. The main configuration file used by the Web server, Scheduler, and workers is airflow.cfg.
The Data Engineer can view the entire flow through the Airflow UI. Users can also check the logs, monitor the pipelines, and set alerts.
Airflow Deployment Options
When deploying Apache Airflow, there are multiple approaches to consider, each with unique advantages and challenges. Let us see the different deployment examples:
Amazon Managed Workflows for Apache Airflow (MWAA)
You should configure the service through the AWS Management Console. There, you can define your environment, set up necessary permissions, and integrate with other AWS services.
Google Cloud Composer:
Create an environment using the Google Cloud Console and integrate with Google Cloud services like BigQuery and Google Cloud Storage.
Azure Data Factory with Airflow Integration:
Try to Configure Airflow through the Azure Portal. Integrate with other Azure services for efficient workflow automation.
Self-hosted on AWS EC2:
We can launch and configure EC2 instances. We must install Airflow, set up the environment, configure databases, and set up the scheduler.
Running on Kubernetes (e.g., AWS EKS):
We can create Kubernetes clusters, deploy Airflow using Helm charts or custom YAML files, and manage container orchestration and scaling.
These are the different options or ways to deploy Airflow, but we are focusing on Amazon Web Service EKS to deploy Airflow, so let us see this in the below section.
Deploy Airflow on AWS EKS
Let us install Apache Airflow in the EKS cluster using the helm chart.
You will get the Airflow webserver and default Postgres connection credentials in the output. Copy them and save them somewhere.
5. Examine the deployments by getting the Pods
Kubectl get pods -n airflow
The Airflow instance is set up in EKS. All the airflow pods should be running.
Let’s prepare Airflow to run our first DAG
At this point, Airflow is deployed using the default configuration. Let's see how we can get the default values from the helm chart on our local machine, modify it, and update a new release.
1. Save the configuration values from the helm chart by running the below command.
helm show values apache-airflow/airflow > values.yaml
This command generates a file named
values.yaml
in your current directory, which you can modify and save as needed.
2. Check the release version of the helm chart by running the following command.
helm ls -n airflow
3. Let us add the ingress configuration to access the airflow instance over the internet. We need to deploy an ingress controller in the EKS cluster first. The commands below will install the NGINX ingress controller from the helm repository.
kubectl get service nginx-ingress-controller --namespace airflow-ingress
Look for the external IP in the output of the get service command.
After installing the ingress controller, add the required configuration in the values.yaml file and save the file. There is a section dedicated to the ingress configuration.
Below is a “sample_dag.py” that demonstrates a simple workflow.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 8, 8),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('hello_world', default_args=default_args, schedule_interval=timedelta(days=1))
t1 = BashOperator(
task_id='say_hello',
bash_command='echo "Hello World from Airflow!"',
dag=dag,
)
Upon completion, you can see the DAGs in the UI interface. Airflow automatically detects new DAGs, but you can manually refresh the DAGs list in the Airflow UI by clicking the "Refresh" button on the DAGs page.
The UI has many options/settings to experiment with, such as code, graphs, audit logs, etc.
You can also check the EKS cluster’s activity and DAG dashboard from the Activity tab.
Run the Airflow job
DAGs can be scheduled to run or triggered manually from the UI interface. There is a run button on the rightmost side of the DAG table.
Also, it can be triggered from within the DAG.
Make your Airflow on Kubernetes Production-Grade
Apache Airflow is a powerful tool for orchestrating workflows, but making it production-ready requires careful attention to several key areas. Below, we explore strategies to enhance security, performance, monitoring, and ensure high availability in your Airflow deployment.
1. Improved Security
a. Role-Based Access Control (RBAC)
Implementation: Enable RBAC in Airflow to ensure only authorized users can access specific features and data.
Benefits: Limits access to critical areas and reduces the risk of unauthorized changes or data breaches.
Implementation: Right-size your Kubernetes pods and nodes based on the workload demand. Use Kubernetes Horizontal Pod Autoscaler (HPA) to scale Airflow resources dynamically and cluster autoscaler to scale nodes.
Benefits: Ensures efficient use of resources, reduces costs, and prevents bottlenecks during peak loads.
Refer to this generic guide on Implementing HPA and Cluster Autoscaler in EKS. HPA will autoscale the different Airflow components, while the Cluster Autoscaler will make sure there are nodes to satisfy those requirements.
b. Task Parallelism
Implementation: Configure Airflow to handle parallel task execution by optimizing the number of worker pods and setting appropriate concurrency limits.
Implementation: Consider having HTTPS for the Airflow URL using TLS/SSL certificates with the Ingress controller in Kubernetes.
Benefits: HTTPS encrypts data to enhance the security of information being transferred. This is especially crucial when handling sensitive data, as encryption helps protect it from unauthorized access during transmission.
Implementation: Integrate Airflow with Prometheus to collect metrics on task performance, resource usage, and system health. Tools like Grafana or Prometheus Alertmanagercan can set up alerts based on critical metrics and log events.
Benefits: It provides visibility into Airflow’s performance, allowing you to identify and address issues proactively and enabling quick response to potential problems, reducing downtime and maintaining workflow reliability.
Implementation: Regularly backup Airflow’s database and configuration files. Implement a disaster recovery plan that includes rapid failover procedures.
Benefits: Protects against data loss and enables quick recovery in case of catastrophic failures.
Setting up Apache Airflow on Amazon EKS is a powerful way to manage your workflows at scale, but it requires careful planning and configuration to ensure it’s production-ready. Following this guide, you've deployed Airflow on EKS, created a simple DAG, connected Airflow with a private Git repository, and learned about different ways to implement security, performance, high availability, monitoring, and logging. With these optimizations, your Airflow deployment is now more efficient, cost-effective, and ready to handle the demands of real-world data orchestration.
Frequently Asked Questions
1. What is Apache Airflow?
Apache Airflow is an open-source tool that helps in orchestrating and managing workflows through Directed Acyclic Graphs (DAGs). It automates complex processes like ETL (Extract, Transform, Load) jobs, machine learning pipelines, and more.
2. Why deploy Airflow on Amazon EKS?
Deploying Airflow on Amazon EKS offers scalability, flexibility, and robust workflow management. EKS simplifies Kubernetes management, allowing you to focus on scaling and securing your Airflow environment.
3. What are the prerequisites for deploying Airflow on EKS?
You need an AWS account, an EKS cluster, kubectl configured on your local environment, a dynamic storage class using EBS volumes, and Helm for package management.
4. How do I monitor Airflow on EKS?
You can integrate Prometheus and Grafana for monitoring. Using Loki for log aggregation can also help in centralized log management and troubleshooting.
5. What Kubernetes add-ons are recommended for a production-grade Airflow setup?
Essential add-ons include External Secret Operator for secure secrets management, Prometheus and Grafana for monitoring, and possibly Loki for logging.
6. Can Airflow be integrated with external databases like RDS?
Yes, it’s common to configure Airflow to use an external PostgreSQL database hosted on Amazon RDS for production environments, providing reliability and scalability for your metadata storage.
7. How can I access the Airflow UI on EKS?
You can access the Airflow UI by setting up a LoadBalancer service or using an Ingress Controller with a DNS pointing to your load balancer for easy access.
8. How do I manage DAGs in a production environment?
For production, it’s advisable to store your DAGs in a private Git repository and integrate Airflow with this repo using GitSync to pull the latest DAG configurations automatically.
By browsing this website, you acknowledge and accept the use of cookies to enhance your experience and for analytics purposes. Privacy Policy for more information.