Friday, 16 November 2018

Building Tensorflow GPU on Fedora Linux

First off, let's say that there are easy ways of configuring Tensorflow for GPU usage such as using one of the docker images.  However, I'm a bit old school for some things and having always done so I've recently got Tensorflow going on my machine using my GPU.  Tensorflow CPU support is quite easy to do and generally works quite nicely using the pip install method.  GPU support, I've always found, is quite a bit more difficult as there are a whole bunch of things that need to be at just the right level for everything to work i.e. it's quite brittle!

What follows are my notes (it's in the name of the blog) for how to build Tensorflow from scratch to enable GPU support and I do this on Fedora Linux.  If you want to know why it's worth bothering going to this effort, I've tested the Keras MNIST CNN example as a bench mark.  It takes:
  • 11 minutes 7 seconds on my CPU
  • 2 minutes 55 seconds on my GPU
That's just over 3.8 as fast on my GPU as per my CPU so for large jobs this will be huge!

Some info on my machine and config:
  • Lenovo P50 Laptop
  • Intel Core i7-6820HQ CPU @ 2.70GHz (4 core with hyper threading)
  • 32GB RAM
  • Nvidia Quadro M1000M (CUDA compute capability 5.0)
  • Fedora 28 running kernel 4.18.18-200.fc28.x86_64
Install Required Nvidia RPMs
You need to get everything Nvidia and CUDA installed on your machine first.   I quite like the Negativo17 repository for Nvidia on Fedora Linux and so I use this but you could also go with RPM Fusion or even download everything directly from Nvidia.  For me, right now, I have this little lot installed:

You might wonder about some of the above, particularly why you might need a back level version of GCC.  When Fedora 28 has a quite capable GCC version 8 why on earth would you want version 7?  The answer lies in my comment about things being difficult or brittle, it's quite simply that CUDA doesn't yet support GCC 8 so you do need a back level compiler for this

Install NVidia NCCL
This library isn't available through an RPM installation or the Negativo17 repository and so you must:
  • Go to the Nvidia NCCL home page 
  • Click the link to download NCCL (requires an Nvidia developer login account)
  • Agree to the Terms and Conditions
  • Download the NCCL zipped tar file that matches your CUDA version (9.2 for this blog post)

At the time of writing the file required is nccl_2.3.7-1+cuda9.2_x86_64.txz

I simply untar this file into /usr/local and create a symbolic link as follows:
  • cd /usr/local
  • sudo tar -xf /path/to/file/nccl_2.3.7-1+cuda9.2_x86_64.txz
  • sudo ln -s nccl_2.3.7-1+cuda9.2_x86_64.txz nccl

Install the Bazel Build Tool
You're going to need a build tool called Bazel which isn't directly available in the Fedora repositories (that I know of at least) but fortunately there's a version in a copr repository you can use as documented run the following commands:
  •  dnf copr enable vbatts/bazel
  •  dnf install bazel
Get a Copy of Tensorflow Source
For this it's just as easy to use git as it is anything else.  You can directly clone the 1.12 release of Tensorflow into a new directory by running:
  • git clone --single-branch -b r1.12 tensorflow-r1.12
  • cd tensorflow-r1.12
Simply replace r1.12 in the above commands if you want to use a different Tensorflow release.

Run the Tensorflow Configure Script
This step is actually quite simple but you'll need the answers to some questions to hand, simply run:
  • ./configure
I accept all the default options with the exception of:
  • "location of python" set to /usr/bin/python3 since Fedora still uses Python 2.7 as the default version at /usr/bin/python
  • "build TensorFlow with CUDA support" set to Yes
  • "CUDA SDK version" set to 9.2 (this value should match the cuda version you have installed and at the time of writing 9.2 is the current version from the Negativo17 repository)
  • "location where CUDA 9.2 toolkit is installed" set to /usr
  • "cuDNN version" set to 7.2 (similar to the cuda version above, this value should match the cuda-cudnn package version and 7.2 is the current version from the Negativo17 repository)
  • "NCCL version" set to 2.3
  • "location where NCCL 2 library is installed" set to /usr/local/nccl
  • "Cuda compute capabilities you want to build with" set to 5.0 (but this value should match the CUDA compute capability of the GPU in the machine you're building for)
  • "which gcc" set to /usr/bin/cuda-gcc (to use the back level GCC version 7)

Fix Bazel Config
The above config command writes a file but the location isn't compatible with the latest version of Bazel.  Presumably this issue will be fixed at some point in the future, it's not an issue with Bazel 0.18 and below as far as I'm aware, but has just become an issue on 0.19.  Simply copy the config to the correct place:
  • cat tools/bazel.rc >> .tf_configure.bazelrc
Build Tensorflow with GPU Support
This took around an hour to complete on my machine:
  • bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
The first step is the long one for the build, the second simply builds the python wheel file.

Install Tensorflow with GPU Support
You've got your wheel file so simply install and enjoy:
  • pip3 install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl 
Run Some Code
The first time I attempted to run some code to test I got an error:
  • failed call to cuInit: CUDA_ERROR_UNKNOWN
This seems to be some sort of permissions issue and running the following simple script to output the GPUs available on my machine but as root seems to have fixed the above issue i.e. put the following into a script, run that script as root, then any time you want to run code as an unprivileged user the above issue is fixed and the code will work:
from keras import backend as K

If the above works then you can try out the Keras MNIST CNN example code.

No comments: