How to use several dag_folders? Airflow DAGBags.

Iuliia Volkova
5 min readNov 14, 2018

This article about one case in Apache Airflow, so if you don’t know what is this, please visit https://airflow.apache.org and read other articles about this cool stuff.

In my work, I often meet with a question or thought what it’s very difficult to live with one DAG folder in a big project. You have many DAGs, they have different purposes, different peoples work on them and etc. And very hard without some way to separate all those DAGs in different folders with different right fo deployment pipelines and, etc.

Now, we have version Apache Airflow 1.10 and we still can set only one dag_folder in config. But… we always had the way to hack it and separate DAGs on such many folders as you want. And we have this way now.

A key for this is DAGBag.

Usually, in high-level overview, when we met this term ‘DAGBag’ it’s mean for us a folder in which you need to put your DAG to make possible for Airflow to find it. And it’s a dags_foler from airflow.cfg. But at real it’s not correct.

A DagBag is a collection of dags, parsed out of a folder tree and has high-level configuration settings. This makes it easier to run distinct environments for say production and development, tests, or for different teams or security profiles. What would have been system level settings are now dagbag level so that one system can run multiple, independent settings sets.

This is a quote from DagBag class docstring. I think it describes purposes of this class pretty well. In airflow.cfg you can define only one path to dags folder in ‘dags_folder =’ param. So, how can you use it and add other dirs to load DAGs?

You need to put in main DAG folder file that will add new DAGs bags to your Airflow.

Let’s see an example. We will start with empty Airflow Server with load standard example DAGs. You can run it uses tutorial steps from the official quick start: https://airflow.readthedocs.io/en/stable/start.html .

After Webserver and Scheduler start, go to UI and check what we see in UI:

Great! Lovely.

At the install, I set up what my $AIRFLOW_HOME will be ‘~/airflow_tutorial’ (‘~’ — my user home directory on MacOS ), so my config and my dag_foler looks like:

Now, will check what all works fine. Prepare empty DAG with print_hello task to check what all works correctly.

Our DAGfile will be very simple:

from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

dag_id = "A_first_dag_from_config_dag_folder"

with DAG(dag_id=dag_id, start_date=datetime(2018, 11, 14),
schedule_interval=None) as dag:

def say_hello():
print("Hello, guys! Your are the best!")

PythonOperator(task_id="say_hello", python_callable=say_hello)

Great, go to UI to see what our DAG exist and we can trigger it:

Trigger and check what all works correctly:

Great.

So now return to the DAGBags. As we mentioned before, to add DAGs from different folders you need to use DAGBag, but how?

You need to inspire Airflow to add new folders with DAGBags. For that, we need to put in your standard dag_folder special tiny python script.

Call this file something like ‘add_dag_bags.py’ with very simple code inside. To show, how it works, we will create two separate folders: ‘~/new_dag_bag1’ and ‘~/work/new_dag_bag2’. It does not matter how much long path and there it is placed. Airflow just must have rights to access those folders.

Code in add_dag_bags.py will be:

""" add additional DAGs folders """
import os
from airflow.models import DagBag
dags_dirs = ['~/new_dag_bag1', '~/work/new_dag_bag2']

for dir in dags_dirs:
dag_bag = DagBag(os.path.expanduser(dir))

if dag_bag:
for dag_id, dag in dag_bag.dags.items():
globals()[dag_id] = dag

And now we have such structure with new folders:

In Airflow Webserver log you will already see such output, as soon as you put an add_dag_bags.py into your main dag_folder from config:

Remember, what you need to put words ‘DAG’ and ‘airflow’ in a code to trigger Airflow for parsing this file. Check a ‘notes’ under DAG description in the official doc:
https://airflow.incubator.apache.org/concepts.html#dags

Pay attention, what if you load this way DAGs from additional DAGBags and those DAGs will be broken, you will not see the Traceback in UI. You can find it only in Airflow Webserver log. Also, you can not catch such DAGs with ‘airflow list_dags’ because all such commands and features work only with default dag folder from config.

Let’s create two DAGs to check behavior on UI. One will be broken — with wrong import, and second will be correct.

We prepare two DAGs (just copy the test DAG what we use before and modify dag_id):

Broken DAG from bag1 (we added a line with an import of not exists module)

And correct DAG from bag2:

Let’s check UI.

Oh, we see what one of the DAGs is loaded correctly, but second not. And we don’t see any Traceback in UI. But let check Webserver log:

Great, we see the error.

Now let’s trigger correct DAG to check what all works well:

And we got a successful result. So, this way with a little script you can add any source folders with you dags and as many as you need.

P.S: Right now, Airflow Community work on the feature, what calling DAGFetcher, that will cover problem described in this article (but not only this problem!!). You can know about it more from proposal — https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+DagFetcher.

The article is inspired by Tamara Mendt speech on PyCon.DE 2017 — Modern ETL-ing with Python and Airflow (and Spark) — https://www.youtube.com/watch?v=tcJhSaowzUI.

--

--