Friday, 16 November 2018

Building Tensorflow GPU on Fedora Linux

First off, let's say that there are easy ways of configuring Tensorflow for GPU usage such as using one of the docker images.  However, I'm a bit old school for some things and having always done so I've recently got Tensorflow going on my machine using my GPU.  Tensorflow CPU support is quite easy to do and generally works quite nicely using the pip install method.  GPU support, I've always found, is quite a bit more difficult as there are a whole bunch of things that need to be at just the right level for everything to work i.e. it's quite brittle!

What follows are my notes (it's in the name of the blog) for how to build Tensorflow from scratch to enable GPU support and I do this on Fedora Linux.  If you want to know why it's worth bothering going to this effort, I've tested the Keras MNIST CNN example as a bench mark.  It takes:
  • 11 minutes 7 seconds on my CPU
  • 2 minutes 55 seconds on my GPU
That's just over 3.8 as fast on my GPU as per my CPU so for large jobs this will be huge!

Some info on my machine and config:
  • Lenovo P50 Laptop
  • Intel Core i7-6820HQ CPU @ 2.70GHz (4 core with hyper threading)
  • 32GB RAM
  • Nvidia Quadro M1000M (CUDA compute capability 5.0)
  • Fedora 28 running kernel 4.18.18-200.fc28.x86_64
Install Required Nvidia RPMs
You need to get everything Nvidia and CUDA installed on your machine first.   I quite like the Negativo17 repository for Nvidia on Fedora Linux and so I use this but you could also go with RPM Fusion or even download everything directly from Nvidia.  For me, right now, I have this little lot installed:
cuda-9.2.148.1-2.fc28.x86_64
cuda-cli-tools-9.2.148.1-2.fc28.x86_64
cuda-cublas-9.2.148.1-2.fc28.x86_64
cuda-cublas-devel-9.2.148.1-2.fc28.x86_64
cuda-cudart-9.2.148.1-2.fc28.x86_64
cuda-cudart-devel-9.2.148.1-2.fc28.x86_64
cuda-cudnn-7.2.1.38-1.fc28.x86_64
cuda-cudnn-devel-7.2.1.38-1.fc28.x86_64
cuda-cufft-9.2.148.1-2.fc28.x86_64
cuda-cufft-devel-9.2.148.1-2.fc28.x86_64
cuda-cupti-9.2.148.1-2.fc28.x86_64
cuda-cupti-devel-9.2.148.1-2.fc28.x86_64
cuda-curand-9.2.148.1-2.fc28.x86_64
cuda-curand-devel-9.2.148.1-2.fc28.x86_64
cuda-cusolver-9.2.148.1-2.fc28.x86_64
cuda-cusolver-devel-9.2.148.1-2.fc28.x86_64
cuda-cusparse-9.2.148.1-2.fc28.x86_64
cuda-cusparse-devel-9.2.148.1-2.fc28.x86_64
cuda-devel-9.2.148.1-2.fc28.x86_64
cuda-docs-9.2.148.1-2.fc28.noarch
cuda-gcc-7.3.0-1.fc28.x86_64
cuda-gcc-c++-7.3.0-1.fc28.x86_64
cuda-gcc-gfortran-7.3.0-1.fc28.x86_64
cuda-libs-9.2.148.1-2.fc28.x86_64
cuda-npp-9.2.148.1-2.fc28.x86_64
cuda-npp-devel-9.2.148.1-2.fc28.x86_64
cuda-nvgraph-9.2.148.1-2.fc28.x86_64
cuda-nvgraph-devel-9.2.148.1-2.fc28.x86_64
cuda-nvml-devel-9.2.148.1-2.fc28.x86_64
cuda-nvrtc-9.2.148.1-2.fc28.x86_64
cuda-nvrtc-devel-9.2.148.1-2.fc28.x86_64
cuda-nvtx-9.2.148.1-2.fc28.x86_64
cuda-nvtx-devel-9.2.148.1-2.fc28.x86_64
nvidia-driver-cuda-libs-410.73-4.fc28.x86_64

You might wonder about some of the above, particularly why you might need a back level version of GCC.  When Fedora 28 has a quite capable GCC version 8 why on earth would you want version 7?  The answer lies in my comment about things being difficult or brittle, it's quite simply that CUDA doesn't yet support GCC 8 so you do need a back level compiler for this

Install NVidia NCCL
This library isn't available through an RPM installation or the Negativo17 repository and so you must:
  • Go to the Nvidia NCCL home page 
  • Click the link to download NCCL (requires an Nvidia developer login account)
  • Agree to the Terms and Conditions
  • Download the NCCL zipped tar file that matches your CUDA version (9.2 for this blog post)

At the time of writing the file required is nccl_2.3.7-1+cuda9.2_x86_64.txz

I simply untar this file into /usr/local and create a symbolic link as follows:
  • cd /usr/local
  • sudo tar -xf /path/to/file/nccl_2.3.7-1+cuda9.2_x86_64.txz
  • sudo ln -s nccl_2.3.7-1+cuda9.2_x86_64.txz nccl


Install the Bazel Build Tool
You're going to need a build tool called Bazel which isn't directly available in the Fedora repositories (that I know of at least) but fortunately there's a version in a copr repository you can use as documented run the following commands:
  •  dnf copr enable vbatts/bazel
  •  dnf install bazel
Get a Copy of Tensorflow Source
For this it's just as easy to use git as it is anything else.  You can directly clone the 1.12 release of Tensorflow into a new directory by running:
  • git clone --single-branch -b r1.12 https://github.com/tensorflow/tensorflow tensorflow-r1.12
  • cd tensorflow-r1.12
Simply replace r1.12 in the above commands if you want to use a different Tensorflow release.

Run the Tensorflow Configure Script
This step is actually quite simple but you'll need the answers to some questions to hand, simply run:
  • ./configure
I accept all the default options with the exception of:
  • "location of python" set to /usr/bin/python3 since Fedora still uses Python 2.7 as the default version at /usr/bin/python
  • "build TensorFlow with CUDA support" set to Yes
  • "CUDA SDK version" set to 9.2 (this value should match the cuda version you have installed and at the time of writing 9.2 is the current version from the Negativo17 repository)
  • "location where CUDA 9.2 toolkit is installed" set to /usr
  • "cuDNN version" set to 7.2 (similar to the cuda version above, this value should match the cuda-cudnn package version and 7.2 is the current version from the Negativo17 repository)
  • "NCCL version" set to 2.3
  • "location where NCCL 2 library is installed" set to /usr/local/nccl
  • "Cuda compute capabilities you want to build with" set to 5.0 (but this value should match the CUDA compute capability of the GPU in the machine you're building for)
  • "which gcc" set to /usr/bin/cuda-gcc (to use the back level GCC version 7)


Fix Bazel Config
The above config command writes a file but the location isn't compatible with the latest version of Bazel.  Presumably this issue will be fixed at some point in the future, it's not an issue with Bazel 0.18 and below as far as I'm aware, but has just become an issue on 0.19.  Simply copy the config to the correct place:
  • cat tools/bazel.rc >> .tf_configure.bazelrc
Build Tensorflow with GPU Support
This took around an hour to complete on my machine:
  • bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
The first step is the long one for the build, the second simply builds the python wheel file.

Install Tensorflow with GPU Support
You've got your wheel file so simply install and enjoy:
  • pip3 install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl 
Run Some Code
The first time I attempted to run some code to test I got an error:
  • failed call to cuInit: CUDA_ERROR_UNKNOWN
This can be solved by making sure you have the nvidia-modprobe package installed.  Alternatively, you can run the little script below the following explanation.

This seems to be some sort of permissions issue and running the following simple script to output the GPUs available on my machine but as root seems to have fixed the above issue i.e. put the following into a script, run that script as root, then any time you want to run code as an unprivileged user the above issue is fixed and the code will work:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()

If the above works then you can try out the Keras MNIST CNN example code.

6 comments:

Unknown said...

Hi,

I'm happy to find a fellow TF-on-Fedora nerd ... May I ask if building TF on Fedora still works for you?
I switched from manual driver + cuda installation to negativo17 about 2 months ago (on a fresh install of F30), and at first I was able to build successfully (there were a few quirks though, see https://github.com/tensorflow/tensorflow/issues/29797).

Then a few weeks ago I got a new error - see https://groups.google.com/a/tensorflow.org/forum/#!topic/build/AB_nEXhUF0E - and I've not been able to fix that one. By now I think the current TF build does require a CUDA installation in /usr/local/cuda, because they are doing lots of stuff in /usr/local/cuda/bin which if one has /usr as the CUDA root, just ends up to be /usr/bin which is pretty horrible ;-))

I was hoping for someone to react to my above mail to TF, but there was no answer... I do understand they have higher priorities, but I'd still like to be able to build from source, and I'd rather stay with negativo17 which overall works a lot better than the manual method (no dkms failures, etc.).

Just asking for your current experience (if you're still interested)? Thanks!

Graham White said...

Hi Sigrid,

Thanks for letting me know about your comment. I found a problem where Google wasn't notifying me of comments to the blog so I hope I've fixed that now - many thanks!

Building TF on Fedora does still work for me, yes. It's very similar to the process I've documented in the blog post except one or two bugs have been fixed but one or two other problems have been introduced as well. However, I'm now at the stage where the hardware on my work laptop is too old (CUDA compute capability is too low) for the supported versions of drivers that are easily installed. For example, Negativo17 tends to support the latest CUDA version where as Tensorflow doesn't. This isn't necessarily a major issue as you can downgrade to an earlier CUDA from Negativo17 and still get that side of things going. Where I struggle now is just that the holy grail of requirements just don't work on my laptop when trying to match up the CUDA version with the Tensorflow version and the compute capability I have.

So yes, I'm still very much interested and I really wish they'd make Fedora as much of a first-class citizen as they do Ubuntu but that seems unlikely as they favour using Docker images for GPU acceleration these days anyway.

Unknown said...

Hi Graham,

thanks for responding, and I'm glad to hear you're still interested in this! :-)
I'm pretty sure they must have changed something in the last few weeks, as I could still successfully build in June (with CUDA 10.1 from negativo17).

But already then I found it weird to see that they were doing stuff below /usr that couldn't really be intended to be done there (like, copying stuff like sudo, sudoedit etc. some place else).

And now the failure evidentĺy is that they're trying to create a tempfile /usr (which fails because of missing permissions).

So this is why I think the current build needs a separate CUDA in /usr/local.

Right now I've really run out of ideas what to do (I'm also not a build specialist so I can't just quickly tweak the cuda part of the build ...).

If you ever try again it would be great if you could let me know how it works for you now :-)
BTW CUDA 10.1 works fine when you build from source, it's just the PyPi wheels that need CUDA 10.0 ...

One question, as you mention that, would you happen to know how I can downgrade to CUDA 10.0 from negativo17, to at least run a PyPi wheel? dnf downgrade just gives me a prior version of 10.1.
Thanks!

Graham White said...

I've just checked and it looks like CUDA 10.0 has been removed from Negativo17 so it wont be possible to downgrade. You could check to see if RPM Fusion offers a 10.0 install? They have guides for the NVidia driver as well as CUDA.

Unknown said...

Thanks! I had been on that page but now looking again I see

Tweak the /usr/local/cuda-9.2/targets/x86_64-linux/include/host_defines.h to accept the Fedora default compiler. (Not recommended).

That sounds like they install into /usr/local? In that case I really might try that (hoping it helps with the source build)...

Graham White said...

Best of luck. I'll update the post (or perhaps write another one) should I have another go at this at some point. This will involve me doing it on a different machine at work or buying a new GPU for my box at home though so it might be a while. For reference, it should be possible not to get stuck with the requirement on building in /usr but I've not tried with the very latest stack so can't offer much more sensible advice than the usual suggestions of checking you've got the various build prefixes set in the appropriate place(s).