Showing posts with label gpu. Show all posts
Showing posts with label gpu. Show all posts

Friday 13 September 2019

Installing Tensorflow GPU on Fedora Linux

Following on from my previous notes on building Tensorflow for a GPU on Fedora, I find myself back at it again.  I recently upgraded my GPU at home and time has moved on too so this is my current set of notes for what I'm doing with Tensorflow on Fedora.  This method, however, differs from my previous notes in as much as I'm using the pre-built Tensorflow rather than building my own.  I've found that Tensorflow is so brittle during the build process it's much easier to work with pre-built binaries and set up my system to match their build.

In my previous blog post I benchmarked the CPU versus GPU using the Keras MNIST CNN example and so I thought it would be interesting to offer the same for this new install on my home machine.  The results are  :
  • 12 minutes and 14 seconds on my CPU
  • 1 minutes and 14 seconds on my GPU
That's just over 9.9 as fast on my GPU as my CPU!

Some info on my machine and config:
  • Custom Built Home PC
  • Intel Core i5-3570K CPU @ 3.40GHz (4 cores)
  • 16GB RAM
  • NVidia GeForce GTX 1660 (CUDA Compute Capability 7.5)
  • Fedora 30 Workstation running kernel 5.2.9-200.fc30.x86_64
Background Information for NVidia Drivers
Previously, I've always used the Negativo17 repository for all my NVidia driver and CUDA needs.  However, the software versions available there are too up-to-date to allow Tensorflow GPU to be installed in a way that works.  This repository provides CUDA 10.1 where as Tensorflow, currently at version 1.14, only supports CUDA 10.0.  So we must use another source for the NVidia software that provides back-level versions.  Fortunately, there is an official NVidia repository providing drivers and CUDA for Linux, so let's use that since it also works quite nicely with the RPM Fusion repositories as well.  Hence, this method relies purely on RPM Fusion and the official NVidia repository and does not require or use the Negativo17 repository (although it would be possible to do so).

Install Required NVidia Driver
The RPM Fusion NVidia instructions can be used here for more detail, but in brief simply install the display drivers:
  • dnf install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda
There are some other bits you might want from this repository as well such as:
    • dnf install vdpauinfo libva-vdpau-driver libva-utils nvidia-modprobe
    Wait for the driver to build and reboot to get things up and running.

    Install Required NVidia CUDA and Machine Learning Libraries
    This step relies on using the official nvidia repositories with a little more information available in the RPM Fusion CUDA instructions.

    First of all, add a new yum configuration file.  Copy the following to /etc/yum.repos.d/nvidia.repo:

    [nvidia-cuda]
    name=nvidia-cuda
    enabled=1
    gpgcheck=1
    gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/fedora27/x86_64/7fa2af80.pub
    exclude=akmod-nvidia*,kmod-nvidia*,*nvidia*,nvidia-*,cuda-nvidia-kmod-common,dkms-nvidia,nvidia-libXNVCtrl

    [nvidia-machine-learning]

    name=nvidia-machine-learning
    baseurl=http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/
    enabled=1
    gpgcheck=1
    gpgkey=http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/7fa2af80.pub
    exclude=libcudnn7*.cuda10.1,libnccl*.cuda10.1



    Note that the configuration above deliberately targets the fedora27 repository from NVidia.  This is because it is the location at which we can find CUDA 10.0 compatible libraries rather than CUDA 10.1 libraries that will be found in later repositories.  So the configuration above is likely to need to change over time but essentially the message here is that we can match the version of CUDA required by targeting the appropriate repository from NVidia.  These libraries will be binary compatible with future versions of Fedora so this action should be safe to do for some time yet.



    With the following configuration in place we can now install CUDA 10.0 and the machine learning libraries required for Tensorflow GPU support and all of the libraries get installed in the correct places that Tensorflow expects.

    To install, run:
    • dnf install cuda libcudnn7 libnccl

    Install Tensorflow GPU
    The final piece of the puzzle is to install Tensorflow GPU which is now as easy as:
    • pip3 install tensorflow-gpu

    Friday 16 November 2018

    Building Tensorflow GPU on Fedora Linux

    <update on Sept 13th 2019>
    I have written another post on how to install (rather than build) Tensorflow GPU for Fedora that uses a different and much simpler method.  See Installing Tensorflow GPU on Fedora Linux.
    </update>

    First off, let's say that there are easy ways of configuring Tensorflow for GPU usage such as using one of the docker images.  However, I'm a bit old school for some things and having always done so I've recently got Tensorflow going on my machine using my GPU.  Tensorflow CPU support is quite easy to do and generally works quite nicely using the pip install method.  GPU support, I've always found, is quite a bit more difficult as there are a whole bunch of things that need to be at just the right level for everything to work i.e. it's quite brittle!

    What follows are my notes (it's in the name of the blog) for how to build Tensorflow from scratch to enable GPU support and I do this on Fedora Linux.  If you want to know why it's worth bothering going to this effort, I've tested the Keras MNIST CNN example as a bench mark.  It takes:
    • 11 minutes 7 seconds on my CPU
    • 2 minutes 55 seconds on my GPU
    That's just over 3.8 as fast on my GPU as per my CPU so for large jobs this will be huge!

    Some info on my machine and config:
    • Lenovo P50 Laptop
    • Intel Core i7-6820HQ CPU @ 2.70GHz (4 core with hyper threading)
    • 32GB RAM
    • Nvidia Quadro M1000M (CUDA compute capability 5.0)
    • Fedora 28 running kernel 4.18.18-200.fc28.x86_64
    Install Required Nvidia RPMs
    You need to get everything Nvidia and CUDA installed on your machine first.   I quite like the Negativo17 repository for Nvidia on Fedora Linux and so I use this but you could also go with RPM Fusion or even download everything directly from Nvidia.  For me, right now, I have this little lot installed:
    cuda-9.2.148.1-2.fc28.x86_64
    cuda-cli-tools-9.2.148.1-2.fc28.x86_64
    cuda-cublas-9.2.148.1-2.fc28.x86_64
    cuda-cublas-devel-9.2.148.1-2.fc28.x86_64
    cuda-cudart-9.2.148.1-2.fc28.x86_64
    cuda-cudart-devel-9.2.148.1-2.fc28.x86_64
    cuda-cudnn-7.2.1.38-1.fc28.x86_64
    cuda-cudnn-devel-7.2.1.38-1.fc28.x86_64
    cuda-cufft-9.2.148.1-2.fc28.x86_64
    cuda-cufft-devel-9.2.148.1-2.fc28.x86_64
    cuda-cupti-9.2.148.1-2.fc28.x86_64
    cuda-cupti-devel-9.2.148.1-2.fc28.x86_64
    cuda-curand-9.2.148.1-2.fc28.x86_64
    cuda-curand-devel-9.2.148.1-2.fc28.x86_64
    cuda-cusolver-9.2.148.1-2.fc28.x86_64
    cuda-cusolver-devel-9.2.148.1-2.fc28.x86_64
    cuda-cusparse-9.2.148.1-2.fc28.x86_64
    cuda-cusparse-devel-9.2.148.1-2.fc28.x86_64
    cuda-devel-9.2.148.1-2.fc28.x86_64
    cuda-docs-9.2.148.1-2.fc28.noarch
    cuda-gcc-7.3.0-1.fc28.x86_64
    cuda-gcc-c++-7.3.0-1.fc28.x86_64
    cuda-gcc-gfortran-7.3.0-1.fc28.x86_64
    cuda-libs-9.2.148.1-2.fc28.x86_64
    cuda-npp-9.2.148.1-2.fc28.x86_64
    cuda-npp-devel-9.2.148.1-2.fc28.x86_64
    cuda-nvgraph-9.2.148.1-2.fc28.x86_64
    cuda-nvgraph-devel-9.2.148.1-2.fc28.x86_64
    cuda-nvml-devel-9.2.148.1-2.fc28.x86_64
    cuda-nvrtc-9.2.148.1-2.fc28.x86_64
    cuda-nvrtc-devel-9.2.148.1-2.fc28.x86_64
    cuda-nvtx-9.2.148.1-2.fc28.x86_64
    cuda-nvtx-devel-9.2.148.1-2.fc28.x86_64
    nvidia-driver-cuda-libs-410.73-4.fc28.x86_64

    You might wonder about some of the above, particularly why you might need a back level version of GCC.  When Fedora 28 has a quite capable GCC version 8 why on earth would you want version 7?  The answer lies in my comment about things being difficult or brittle, it's quite simply that CUDA doesn't yet support GCC 8 so you do need a back level compiler for this

    Install NVidia NCCL
    This library isn't available through an RPM installation or the Negativo17 repository and so you must:
    • Go to the Nvidia NCCL home page 
    • Click the link to download NCCL (requires an Nvidia developer login account)
    • Agree to the Terms and Conditions
    • Download the NCCL zipped tar file that matches your CUDA version (9.2 for this blog post)

    At the time of writing the file required is nccl_2.3.7-1+cuda9.2_x86_64.txz

    I simply untar this file into /usr/local and create a symbolic link as follows:
    • cd /usr/local
    • sudo tar -xf /path/to/file/nccl_2.3.7-1+cuda9.2_x86_64.txz
    • sudo ln -s nccl_2.3.7-1+cuda9.2_x86_64.txz nccl


    Install the Bazel Build Tool
    You're going to need a build tool called Bazel which isn't directly available in the Fedora repositories (that I know of at least) but fortunately there's a version in a copr repository you can use as documented run the following commands:
    •  dnf copr enable vbatts/bazel
    •  dnf install bazel
    Get a Copy of Tensorflow Source
    For this it's just as easy to use git as it is anything else.  You can directly clone the 1.12 release of Tensorflow into a new directory by running:
    • git clone --single-branch -b r1.12 https://github.com/tensorflow/tensorflow tensorflow-r1.12
    • cd tensorflow-r1.12
    Simply replace r1.12 in the above commands if you want to use a different Tensorflow release.

    Run the Tensorflow Configure Script
    This step is actually quite simple but you'll need the answers to some questions to hand, simply run:
    • ./configure
    I accept all the default options with the exception of:
    • "location of python" set to /usr/bin/python3 since Fedora still uses Python 2.7 as the default version at /usr/bin/python
    • "build TensorFlow with CUDA support" set to Yes
    • "CUDA SDK version" set to 9.2 (this value should match the cuda version you have installed and at the time of writing 9.2 is the current version from the Negativo17 repository)
    • "location where CUDA 9.2 toolkit is installed" set to /usr
    • "cuDNN version" set to 7.2 (similar to the cuda version above, this value should match the cuda-cudnn package version and 7.2 is the current version from the Negativo17 repository)
    • "NCCL version" set to 2.3
    • "location where NCCL 2 library is installed" set to /usr/local/nccl
    • "Cuda compute capabilities you want to build with" set to 5.0 (but this value should match the CUDA compute capability of the GPU in the machine you're building for)
    • "which gcc" set to /usr/bin/cuda-gcc (to use the back level GCC version 7)


    Fix Bazel Config
    The above config command writes a file but the location isn't compatible with the latest version of Bazel.  Presumably this issue will be fixed at some point in the future, it's not an issue with Bazel 0.18 and below as far as I'm aware, but has just become an issue on 0.19.  Simply copy the config to the correct place:
    • cat tools/bazel.rc >> .tf_configure.bazelrc
    Build Tensorflow with GPU Support
    This took around an hour to complete on my machine:
    • bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
    • bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
    The first step is the long one for the build, the second simply builds the python wheel file.

    Install Tensorflow with GPU Support
    You've got your wheel file so simply install and enjoy:
    • pip3 install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl 
    Run Some Code
    The first time I attempted to run some code to test I got an error:
    • failed call to cuInit: CUDA_ERROR_UNKNOWN
    This can be solved by making sure you have the nvidia-modprobe package installed.  Alternatively, you can run the little script below the following explanation.

    This seems to be some sort of permissions issue and running the following simple script to output the GPUs available on my machine but as root seems to have fixed the above issue i.e. put the following into a script, run that script as root, then any time you want to run code as an unprivileged user the above issue is fixed and the code will work:
    from keras import backend as K
    K.tensorflow_backend._get_available_gpus()

    If the above works then you can try out the Keras MNIST CNN example code.