Nvidia made the great technical decision to add GPU support to Docker images. This is all great, but in practice it creates a whole new set of problems. I am trying to explain my configuration, my issues and offer a possible solution.

Docker vs Nvidia-Docker

Most of you use Docker, but probably don’t know anything about docker images with GPU support. These are very useful for Machine Learning stacks (e.g. TensorFlow, Torch, etc.), which can be deployed on CPU, but in most cases the evaluation will take much take longer. For most decent mid-sized models it is 20x faster. Nobody trains large ML models on CPU. It is simply not feasible. This assumption holds up for evaluation as well. Therefore you need GPU support.

One option is to run bare metal. If that is not good enough for you, you can use Nvidia-Docker. Now, in theory, this is just as good as Docker is plus GPU support. So it’s better … right?

The Nvidia Docker repository has a list of benefits of GPU containerization. You can read more about it here.

Nvidia-Docker

Nvidia-Docker comes with a different package, client and docker-plugin. These can be easily installed using their provided images. Now, instead of running a container with:

  docker run <image> <command>

you use:

  nvidia-docker run <image> <command>

Easy.

So far, nvidia-docker seems like a better version of docker. It can do everything docker can do plus GPU support. Why shouldn’t we use it for everything from now on?

… because it’s not really compliant to the standard.

My deployment pipeline

My deployment pipeline uses Jenkins (within a docker image) which builds other docker images. So far, I hope, I did not shock anybody.

You can also build and deploy docker images outside of the docker images. This is possible due to the docker client-server architecture. If you haven’t done so and skipped this part of the documentation, I really recommend you to have a look at the Docker overview page.

What everybody does when building docker images is they usually re-use the docker unix socket created by the daemon. That is the weird /var/run/docker.sock file, which is not a file, but a unix socket. This allows communication between the docker client (within a docker image) and the docker daemon (outside the docker image).

Now, for nvidia-docker, you’ll just have to connect to the nvidia-docker daemon … right? There is a very subtle problem with this. Nvidia docker does not have a separate daemon (it does fact, but it does not expose the docker protocol). To build your stuff, you will still have to build/run on the docker daemon, which delegates to the nvidia-docker daemon via a docker plugin.

In practice, to make things easier, you will have to share the same /var/run/docker.sock socket between your docker image and your host.

Conclusion: You can build, but you can’t run

Within a regular docker image, with the nvidia-docker package installed, you can build nvidia-docker images. Because of the delegation process within the host docker daemon, your image is able to build correctly.

But …

You would not be able to run this. The nvidia-docker client (within a regular docker images) raises an error that it does not have access to GPU resources. Which I think it is fair …

Now you have two options:

  1. either build everything with nvidia-docker (jenkins and everything else).
  2. ssh into the host machine (under a deployment username) and run your nvidia-docker run script.

The first idea was not an option for me because I really want to control which image has access to the GPU. This should prevent weird errors and configuration issues. I used option 2 because it lets me control what is being used at all time. Plus, I think many deployment engines prefer to ssh into the target machines rather than share the docker unix socket anyway.

The promise: nvidia-docker 2.0

I recently found out that Nvidia is working on a new version of their docker engine which uses runtime environments. You can use the same docker command, but delegate the execution to another environment.

  docker run --runtime=nvidia <image> <command>

This seems promising, but the build process still had to be hacked. In order to build properly, you will have to set the default docker runtime to the nvidia runtime, which (I think) still breaks the control ideas I was talking about previously.