In this article, we’re going to deploy an Airflow application in a Conda environment and secure the application using Nginx and request SSL certificate from Let’s Encrypt. Airflow is a popular tool that we can use to define, schedule, and monitor our complex workflows. We can create Directed Acyclic Graphs (DAGs) to automate tasks across our work platforms, and being open-source, Airflow has a community to provide support and improve continuously. This is a sponsored article by Vultr. Vultr is the world’s largest privately-held cloud computing platform. A favorite with developers, Vultr has served over 1.5 million customers across 185 countries with flexible, scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions. Learn more about Vultr.
Deploying a Server on Vultr
Let’s start by deploying a Vultr server with the Anaconda marketplace application. Sign up and log in to the Vultr Customer Portal. Navigate to the Products page. Select Compute from the side menu. Click Deploy Server. Select Cloud Compute as the server type. Choose a Location. Select Anaconda amongst marketplace applications. Choose a Plan. Select any more features as required in the “Additional Features” section. Click the Deploy Now button.
Creating a Vultr Managed Database
After deploying a Vultr server, we’ll next deploy a Vultr-managed PostgreSQL Database. We’ll also create two new databases in our database instance that will be used to connect with our Airflow application later in the blog. Open the Vultr Customer Portal. Click the Products menu group and navigate to Databases to create a PostgreSQL managed database. Click Add Managed Databases. Select PostgreSQL with the latest version as the database engine. Select Server Configuration and Server Location. Write a Label for the service. Click Deploy Now. After the database is deployed, select Users & Databases. Click Add New Database. Type in a name, click Add Database and name it airflow-pgsql. Repeat steps 9 and 10 to add another database in the same managed database and name it airflow-celery.
Getting Started with Conda and Airflow
Now that we’ve created a Vultr-managed PostgreSQL instance, we’ll use the Vultr server to create a Conda environment and install the required dependencies.
Check for the Conda version:
Create a Conda environment:
“`html
$ conda create -n airflow python=3.8
“`
Activate the environment:
Install Redis server:
“`html
(airflow) $ apt install -y redis-server
“`
Enable the Redis server:
“`html
(airflow) $ sudo systemctl enable redis-server
“`
Check the status:
“`html
(airflow) $ sudo systemctl status redis-server
“`
Install the Python package manager:
“`html
(airflow) $ conda install pip
“`
Install the required dependencies:
“`html
(airflow) $ pip install psycopg2-binary virtualenv redis
“`
Install Airflow in the Conda environment:
“`html
(airflow) $ pip install “apache-airflow[celery]==2.8.1” –constraint “https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.8.txt”
“`
Connecting Airflow with Vultr Managed Database
After preparing the environment, now let’s connect our Airflow application with the two databases we created earlier within our database instance and make necessary changes to the Airflow configuration to make our application production-ready.
Set environment variable for database connection:
“`html
(airflow) $ export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=”postgresql://user:password@hostname:port/db_name”
“`
Make sure to replace the user, password, hostname, and port with the actual values in the connection details section by selecting the airflow-pgsql database. Replace the db_name with airflow-pgsql.
Initialize the metadata database. We must initialize a metadata database for Airflow to create necessary tables and schema that stores information like DAGs and information related to our workflows:
“`html
(airflow) $ airflow db init
“`
Open the Airflow configuration file:
“`html
(airflow) $ sudo nano ~/airflow/airflow.cfg
“`
Scroll down and change the executor:
executor = CeleryExecutor
Link the Vultr-managed PostgreSQL database, and change the value of sql_alchemy_conn:
sql_alchemy_conn = “postgresql://user:password@hostname:port/db_name”
Make sure to replace the user, password, hostname, and port with the actual values in the connection details section by selecting the airflow-pgsql database. Replace the db_name with airflow-pgsql.
Scroll down and change the worker and trigger log ports:
worker_log_server_port = 8794
trigger_log_server_port = 8795
Change the broker_url:
broker_url = redis://localhost:6379/0
Remove the # and change the result_backend:
result_backend = db+postgresql://user:password@hostname:port/db_name
Make sure to replace the user, password, hostname, and port with the actual values in the connection details section by selecting the airflow-celery database. Replace the db_name with airflow-celery.
Save and exit the file.
Create an Airflow user:
“`html
(airflow) $ airflow users create \n –username admin \n –firstname Peter \n –lastname Parker \n –role Admin \n –email spiderman@superhero.org
“`
Make sure to replace all the variable values with the actual values. Enter a password when prompted to set it for the user while accessing the dashboard.
Daemonizing the Airflow Application
Now let’s daemonize our Airflow application so that it runs in the background and continues to run independently even when we close the terminal and log out. These steps will also help us to create a persistent service for the Airflow webserver, scheduler, and celery workers.
View the airflow path:
“`html
(airflow) $ which airflow
“`
Copy and paste the path into the clipboard.
Create an Airflow webserver service file:
“`html
(airflow) $ sudo nano /etc/systemd/system/airflow-webserver.service
“`
Paste the service configurations in the file. airflow webserver is responsible for providing a web-based user interface that will allow us to interact and manage our workflows. These configurations will make a background running service for our Airflow webserver:
“`html
[Unit] Description=”Airflow Webserver” After=network.target [Service] User=example_user Group=example_user ExecStart=/home/example_user/.local/bin/airflow webserver [Install] WantedBy=multi-user.target
“`
Make sure to replace User and Group with your actual non-root sudo user account details, and replace the ExecStart path with the actual Airflow path including the executable binary we copied earlier in the clipboard. Save and close the file.
Enable the airflow-webserver service, so that the webserver automatically starts up during the system boot process:
“`html
(airflow) $ systemctl enable airflow-webserver
“`
Start the service:
“`html
(airflow) $ sudo systemctl start airflow-webserver
“`
Make sure that the service is up and running:
“`html
(airflow) $ sudo systemctl status airflow-webserver
“`
Our output should appear like the one pictured below.
Create an Airflow Celery service file:
“`html
(airflow) $ sudo nano /etc/systemd/system/airflow-celery.service
“`
Paste the service configurations in the file. airflow celery worker starts a Celery worker. Celery is a distributed task queue that will allow us to distribute and execute tasks across multiple workers. The workers connect to our Redis server to receive and execute tasks:
“`html
[Unit] Description=”Airflow Celery” After=network.target [Service] User=example_user Group=example_user ExecStart=/home/example_user/.local/bin/airflow celery worker [Install] WantedBy=multi-user.target
“`
Make sure to replace User and Group with your actual non-root sudo user account details, and replace the ExecStart path with the actual Airflow path including the executable binary we copied earlier in the clipboard. Save and close the file.
Enable the airflow-celery service:
“`html
(airflow) $ sudo systemctl enable airflow-celery
“`
Start the service:
“`html
(airflow) $ sudo systemctl start airflow-celery
“`
Make sure that the service is up and running:
“`html
(airflow) $ sudo systemctl status airflow-celery
“`
Create an Airflow scheduler service file:
“`html
(airflow) $ sudo nano /etc/systemd/system/airflow-scheduler.service
“`
Paste the service configurations in the file. airflow scheduler is responsible for scheduling and triggering the DAGs and the tasks defined in them. It also checks the status of DAGs and tasks periodically:
“`html
[Unit] Description=”Airflow Scheduler” After=network.target [Service] User=example_user Group=example_user ExecStart=/home/example_user/.local/bin/airflow scheduler [Install] WantedBy=multi-user.target
“`
Make sure to replace User and Group with your actual non-root sudo user account details, and replace the ExecStart path with the actual Airflow path including the executable binary we copied earlier in the clipboard. Save and close the file.
Enable the airflow-scheduler service:
“`html
(airflow) $ sudo systemctl enable airflow-scheduler
“`
Start the service:
“`html
(airflow) $ sudo systemctl start airflow-scheduler
“`
Make sure that the service is up and running:
“`html
(airflow) $ sudo systemctl status airflow-scheduler
“`
Our output should appear like that pictured below.
Setting up Nginx as a Reverse Proxy
We’ve created persistent services for the Airflow application, so now we’ll set up Nginx as a reverse proxy to enhance our application’s security and scalability following the steps outlined below.
Log in to the Vultr Customer Portal. Navigate to the Products page. From the side menu, expand the Network drop down, and select DNS. Click the Add Domain button in the center. Follow the setup procedure to add your domain name by selecting the IP address of your server. Set the following hostnames as your domain’s primary and secondary nameservers with your domain registrar: ns1.vultr.com ns2.vultr.com
Install Nginx:
“`html
(airflow) $ apt install nginx
“`
Make sure to check if the Nginx server is up and running:
“`html
(airflow) $ sudo systemctl status nginx
“`
Create a new Nginx virtual host configuration file in the sites-available directory:
“`html
(airflow) $ sudo nano /etc/nginx/sites-available/airflow.conf
“`
Add the configurations to the file. These configurations will direct the traffic on our application from the actual domain to the backend…
Source link