Install Python dependencies to docker-compose cluster without re-build images

Iuliia Volkova
6 min readDec 17, 2019

UPD from 20 July 2020: this article is pretty old and when I write it there was not exist official Docker image so in article I used puckel/docker-airflow. But now, Apache Airflow already created and supported production-ready official docker image. You can read more about it here: https://github.com/apache/airflow/blob/master/IMAGES.rst#airflow-docker-images.

I prepared docker-compose on official Docker image you can find it here: https://github.com/xnuinside/airflow_in_docker_compose/blob/master/docker-compose-with-celery-executor.yml (with Apache Airflow version 1.10.11) and .env file for it https://github.com/xnuinside/airflow_in_docker_compose/blob/master/.env . I hope I will have time soon to create article with explanation.

In this article I will show how easily you can install python dependencies without re-build images each time. There is will be no new magic for you if you experienced Python dev. But maybe this will be helpful for DevOps-es and engineers who works with Python as with second language.

Okay. When it needed? When you work with some Python Servers like, for example, Apache Airflow and you don’t want each time to stop it and re-build containers just to install new python packages. Because this one cluster can be used in same time by a lot of pipelines, and you don’t want to stop them because you need to deploy one more with new dependencies.

What ways exist to solve the issue? You can enter each container and do pip install in each of them, but it cases of Apache Airflow you have, for example, 1 container for scheduler, 1 for webserver, and from 1 to ‘a lot of’ for workers.

I will show just another approach.

What we will use (I will describe each step):

  1. Python python .pth file
  2. Docker volume
  3. and pip with flag:
pip install --target=

Sources as usual at the end of the page.

Test DAG

Let’s start with defining our test DAG for Apache Airflow. It will be pretty simple, we just need DAG to get import error. You can define your own.

from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator

from slack import *

with DAG(dag_id='dependency_dag',
start_date=datetime(2019, 12, 17),
schedule_interval=None) as dag:
DummyOperator(task_id='dummy')

So, my test-DAG dependency will be slackclient

Check Airflow UI (I will use as base docker-compose file from this article and all prerequisites are ‘up’ cluster from there: https://medium.com/@xnuinside/quick-guide-how-to-run-apache-airflow-cluster-in-docker-compose-615eb8abd67a):

import error in Airflow UI

Python python .pth file

All third-party libraries by default leaves in ‘site-packages’ folder of your interpreter. And in this folder you can put ‘.pth’ file (https://docs.python.org/3/library/site.html) that used for adding paths to PYTHONPATH.

So we need:

  1. understand where leave site-packages directory that used by our Python interpret (that run our Apache Airflow (or your server) in docker container)
  2. decide in that folder we will install our packages and define it as volume in docker-compose .yml
  3. put inside container — .pth file with our path to packages folder: add to Dockerfile 2 lines: 1st create dir for packages inside container, 2nd copy .pth file

Let’s create file with any name and ‘.pth’ extension. I will use ‘packages.pth’ in my example. And as folder for packages I will use directory ‘/usr/local/airflow/packages’ in docker container.

So, my packages.pth file will contain only one row:

/usr/local/airflow/packages

That’s it.

Important: this path must be exists at the moment of interpreter run. So, if you not create this folder before server run — your path will not be added to PYTHONPATH.

site-packages folder

To understand where is placed site-packages folder you can do 2 things:

  1. define in DAG simple python code like:
import sys
print(sys.path)

and take a look inside logs, that in site path — site packages always inside it

site-packages in Apache Airflow container

2. enter container with

docker exec -it #container_id /bin/bash

and use python3 command and run same code as upper to execute in the REPL.

Need to be aware that sometimes in containers can be several interpreters (like in Apache Airflow puckle docker image) and make sense to check with that it runs 100% — like execute code inside DAG (with 1st variant) or any other way that you prefer.

In our site-packages path will be:

/usr/local/lib/python3.7/site-packages

Docker volume

Docker volume this is a way to share some local directory/folder to your docker containers. It mean, that they can see and modify files (of course you can set up rights to limit this) in some folder on your VM or local machine.

We decide that in docker container we will use /usr/local/airflow/packages folder and I will map to it local folder ./airflow_files/packages.

That’s mean, that all that will be local installed/putted inside ./airflow_files/packages will be available to Airflow containers.

I will use docker-composer file from this article — https://medium.com/@xnuinside/quick-guide-how-to-run-apache-airflow-cluster-in-docker-compose-615eb8abd67a

We need to do only one thing, add for webserver and scheduler services new line in volumes

- ./airflow_files/packages:/usr/local/airflow/packages

So our volumes will now looks like this:

volumes:
- ./airflow_files/dags:/usr/local/airflow/dags
- ./airflow_files/logs:/usr/local/airflow/logs
- ./airflow_files/packages:/usr/local/airflow/packages

Dockerfile

Now, we need also add our ‘.pth’ file to images and create folder for packages.

So, we will add 2 lines:

RUN mkdir /usr/local/airflow/packages
COPY ./packages.pth /usr/local/lib/python3.7/site-packages

And our Dockerfile now looks like:

FROM puckel/docker-airflow:1.10.6
RUN pip install --user psycopg2-binary
ENV AIRFLOW_HOME=/usr/local/airflow
RUN mkdir /usr/local/airflow/packages
COPY ./packages.pth /usr/local/lib/python3.7/site-packages
COPY ./airflow.cfg /usr/local/airflow/airflow.cfg

.dockerignore

If you have a lot of files in your build directory, as I, you should use .dockerignore file to avoid indexed by Docker a lot of unnecessary files.

At the start of the build Docker run recursively through all the folder and create a tree of files, that he possible need for the build. So if, some files, folders and etc. not needed for the Docker build — add them to ‘.dockerignore’ file. Without it build process became really slow, index can take a very long time.

For my docker-compose project .dockerignore file looks like:

airflow_files/
data/
docs/
LICENSE
README.md

Now you can re-build images and run docker-compose cluster.

After that go and check that now our path from ‘.pth’ showed in PYTHONPATH.

airflow/packages added to site-packages

Great, we see it after ‘site-packages’ path.

pip install with ‘target’ flag

What this flag do? It just use folder that you provide and not ‘site-packages’ as default. So it’s just install distribution packages with dist-info in folder that you want.

Let’s install our slackclient dependency.

pip install --target=./airflow_files/packages slackclient
pip install with target flag

Take a look on your ./airflow_files/packages, you will see packages (target package and it’s dependencies) that was installed.

packages installed in target folder

Now check Airflow UI.

no import error UI

Airflow successful see all dependencies. Great! That exactly that we need.

Sources (you need: to pay attention to comments in Dockerfile, docker-compose-volume-packages.yml, .dockerignore and packages.pth): https://github.com/xnuinside/airflow_in_docker_compose

--

--