Docker for Data Scientist

Why Docker?(Or why not Anaconda?)

Before starting with Docker, you should know there are some other options to build your machine learning (ML) developing environment such as Anaconda or pyenv. Anaconda is a better choice if you‘re totally new to ML because it’s easier to start with. However, there are some problems bothered me that drove me to switch to Docker:

Anaconda didn’t have any China mirror site. Installing new package could be painful because of the download speed.
Although Anaconda is theoretically functional of sharing environment to others, practically it’s quite difficult to do that.
Anaconda could take a lot of disk space if you’re using multiple virtual environments, even if they’re just slightly different.

With Docker, all these problems solved. Docker have multiple China mirror sites. It’s easy to share environment through Dockerhub or with a Dockerfile. Docker takes much fewer disk space by its layer mechanism.

Using Docker just like Anaconda

Note that the following tutorial might not be the best practice of Docker. But it should be the easiest way to use docker to build your own ML environment and develop ML models. Dockerfile is not required in this tutorial. Before we start, I assume you already install Docker, NVIDIA Docker and VSCode. Make sure platform requirements are satisfied before you install.

Platform Requirements

The list of prerequisites for running NVIDIA Container Toolkit is described below:

GNU/Linux x86_64 with kernel version > 3.10

Docker >= 19.03 (recommended, but some distributions may include older versions of Docker. The minimum supported version is 1.12)

NVIDIA GPU with Architecture >= Kepler (or compute capability 3.0)

NVIDIA Linux drivers >= 418.81.07 (Note that older driver releases or branches are unsupported.)

Run this simple test to verify NVIDIA Docker is correctly installed.

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

This should result in a console output shown below:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Some terminology should be clarified. Image is like an Object or Class in Python, Container is an instance of Image. We don’t use Image directly. When we use the Docker run <image> command, we specify a container, every change we make is made on the container.

Let’s start now.

Step 1: Pull an Image

Choose a base image to start with, use the following command to download the image.

$ docker pull <image>:<tag>

There are many choices on Dockerhub. You can download a basic linux image with CUDA and CUDNN installed, such as nvidia/cuda, and install your own python packages. Official deep learning images such as pytorch/pytorch and tensorflow/tensorflow are also provided. It’s also a good idea to start with a Jupyter Notebook Image such as jupyter/scipy-notebook or jupyter/tensorflow-notebook.

Here, I would like to start from the base.

$ docker pull nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04

Note that docker pull is not necessary since docker run will pull images from remote repositories if the image doesn’t exist on local machine.

Step 2: Run a Container

Interactively

Usage

$ docker run --name <container name> -it <Image>:<tag> <command>

Example

$ docker run --name cuda-container -it nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04 /bin/bash

--name specify the container’s name, docker will assign random name to the container if it’s not specified
--interactive , -i keep STDIN open even if not attached
--tty , -t allocate a pseudo-TTY
/bin/bash after <Image>:<tag> is the first command executed, I use /bin/bash here to specify which shell I want to use in the pseudo-TTY.

Run in Background and Expose a port to Use Jupyter Notebook

Let’s expose 8888 port for Jupyter Notebook.

$ docker run -p 8888:8888 -it nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04

--detach , -d run container in background and print container ID
--publish , -p publish a container’s port(s) to the host

Install python3 and jupyter lab inside pseudo-TTY

$ apt-get update \
&& apt-get install -y python3 python3-pip \
&& pip3 install jupyterlab

Run Jupyter Lab

$ jupyter lab --port 8888 --ip 0.0.0.0 --allow-root

--ip 0.0.0.0 is used to solve this issue
--allow-root is used because we are root user in docker

Now you can access Jupyter Lab on your browser through port 8888. Remember to forward your port if you‘re using Docker on a remote server.

Where do I put my data and outputs?

Data is not recommended to store inside container because once container is removed data will be removed together. Docker use Volume to mount directory on the host to container. See more details on Docker Docs: Volumes.

Usage

-v <host_path>:<container_path>

--volume , -v Bind mount a volume

Example

Let’s create an example data directory

$ mkdir data_example && echo 'hello' > data_example/data1

Now mount ~/data_example to /data on the container

$ docker run -v ~/data_example/:/data -it nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04

Check /data on the pseudo-TTY and you should see hello printed on the terminal

$ cat /data/data1

Now write new data to /data and exit the container

$ echo 'world' > /data/data2 && exit

Check whether the new data was written on local directory and you should see world printed on the terminal.

$ cat ~/data_example/data2

What if I exit the container accidentally?

Do I need to keep my docker container running? The answer is NO!

First use the following commend to check any exited container

$ docker container ls -a

without -a only running containers will be shown

You will see something like this

CONTAINER ID   IMAGE                                              COMMAND   CREATED             STATUS
       PORTS     NAMES
474a90545899     nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04      "bash"    15 minutes ago      Exited (0) 15 minutes ago                suspicious_jemison

Use CONTAINER ID to specify the container you want to start by

$ docker start -i 474a90545899

--interactive , -i attach container’s STDIN

Open Multiple Terminals in Docker

Now that I have a running Docker Container, can I execute multiple tasks in Docker? Of course! The docker exec will allow you to run a command in a running container. Let’s say I want to open a new terminal of the running container. The command is as simple as

$ docker exec -it <CONTAINER> bash

Step 3: Commit a Container to a New Image

Let’s say we pull a basic nvidia/cuda Image and installed python3 and Jupyter Lab on the container. How can we save these changes so that next time we can just run an Image with all these packages installed? The answer is docker commit

Usage

$ docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]

Example

Similarly we need to use CONTAINER ID to specify the container

$ docker commit 474a90545899 yuanpinz/mymlenv:python3-jupyterlab

474a90545899 is my CONTAINER ID

Now use docker image ls command to check if the new image was built

REPOSITORY              TAG                                     IMAGE ID       CREATED         SIZE
yuanpinz/mymlenv        python3-jupyterlab                      d0666881ff25   6 seconds ago   9.76GB
nvidia/cuda             11.1.1-cudnn8-devel-ubuntu18.04         3e0ba3c9a1db   5 weeks ago     9.35GB

Step 4: What if I Want to Expose/Publish New Ports or Attach Volumes to an Existing Container?

In the previous examples, we exposed/published ports and attached volumes by using docker run command with -p and -v parameters. What if I want to do these to an existing container instead of planning what ports to be exposed/published and which volume to be attached at the very beginning? The answer is simple but not beautiful. You CAN’T expose/publish ports or attach volumes other than using docker run command. However, you can always commit an existing container to image and use the docker run command again.

For example, we published port 8888 for jupyter lab previously.

$ docker run -p 8888:8888 -it nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04

Let’s say the CONTAINER ID of this container is 474a90545899.

We commit this container to a new Image,

$ docker commit 474a90545899 yuanpinz/cuda:port8888

Now we can publish a new port 6006 for potential use of tensorboard by

$ docker run -p 6006:6006 -it yuanpinz/cuda:port8888

Step 5: Push Your Image to Dockerhub

You can use Dockerhub just like Github to share or synchronize your Image on the cloud. Note that only one private repository is allowed for the free plan. The Docker Hub Quickstart will help you publish your own Image on Dockerhub.

Example

Firstly, login your Docker Hub account.

$ docker login

Secondly, push your Image to Docker Hub

$ docker push yuanpinz/cuda:port8888

Now that your Image is uploaded to Docker Hub, you can pull your Image on any other machine

$ docker pull yuanpinz/cuda:port8888

Step 6: Use VSCode to Edit Files Inside Containers

Check this official post by vscode about Working with containers, if your docker is installed on remote server, read this post Connect to remote Docker over SSH too.

To be simple, all you need to do is to install the Docker extension for VSCode. If you’re using docker on a remote server, make sure the extension is also installed on the server. To do this, you need to connect to your server through SSH on VSCode and install the extension through Marketplace.

Example

Step 1. Since I installed Docker on the remote server, I connected to my server through SSH in vscode and installed the Docker extension (You may need to restart vscode after the installation). If your Docker is installed on your local machine, skip the SSH step and installed the extension locally. Open a workspace. (I don’t know why but this is important, otherwise you will get the “There are no running containers to attach to” error when you try to attach to a Container later). Now you can check your Containers and Images through vscode.

Step 2. Run a Container in the background through command line.

$ docker run -t -d yuanpinz/dlenv:11.1.1-cudnn8-pytorch1.9.0-pyg2.0.0-devkit

-t is necessary otherwise the container will stop immediately.

Note that if you use “Run” or “Run interactive” in the Docker extension. It will run the Container with --rm tag which remove the Container once it exit.

Step 3. Attach VSCode to the Container.

VSCode will install its server inside the container automatically.

Install Python and Pylance extension for programming with Python if it’s not installed automatically. Reload might be required.

Step 4. Now everything is ready. VSCode workspace is set to be your WORKDIR in the Container. Let’s write a “hello world” script. Autocomplete is enabled in VSCode.

Run and Debug works just like local machine.

A Simple Dockerfile Template For Data Scientist

FROM <basic Image>
RUN <install command to install any packages>
COPY <path in the host> <path in the container>
WORKDIR <workdir>

PREVIOUSInstall TeXLive in WSL2 Using VSCode as Editor and configuring support for Chinese

NEXTMarkdown Syntax