Graham White: My Notes: machine learning

Showing posts with label machine learning. Show all posts

Friday 13 September 2019

Installing Tensorflow GPU on Fedora Linux

Following on from my previous notes on building Tensorflow for a GPU on Fedora, I find myself back at it again. I recently upgraded my GPU at home and time has moved on too so this is my current set of notes for what I'm doing with Tensorflow on Fedora. This method, however, differs from my previous notes in as much as I'm using the pre-built Tensorflow rather than building my own. I've found that Tensorflow is so brittle during the build process it's much easier to work with pre-built binaries and set up my system to match their build.

In my previous blog post I benchmarked the CPU versus GPU using the Keras MNIST CNN example and so I thought it would be interesting to offer the same for this new install on my home machine. The results are :

12 minutes and 14 seconds on my CPU
1 minutes and 14 seconds on my GPU

That's just over 9.9 as fast on my GPU as my CPU!

Some info on my machine and config:

Custom Built Home PC
Intel Core i5-3570K CPU @ 3.40GHz (4 cores)
16GB RAM
NVidia GeForce GTX 1660 (CUDA Compute Capability 7.5)
Fedora 30 Workstation running kernel 5.2.9-200.fc30.x86_64

Background Information for NVidia Drivers
Previously, I've always used the Negativo17 repository for all my NVidia driver and CUDA needs. However, the software versions available there are too up-to-date to allow Tensorflow GPU to be installed in a way that works. This repository provides CUDA 10.1 where as Tensorflow, currently at version 1.14, only supports CUDA 10.0. So we must use another source for the NVidia software that provides back-level versions. Fortunately, there is an official NVidia repository providing drivers and CUDA for Linux, so let's use that since it also works quite nicely with the RPM Fusion repositories as well. Hence, this method relies purely on RPM Fusion and the official NVidia repository and does not require or use the Negativo17 repository (although it would be possible to do so).

Install Required NVidia Driver
The RPM Fusion NVidia instructions can be used here for more detail, but in brief simply install the display drivers:

dnf install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda

There are some other bits you might want from this repository as well such as:

dnf install vdpauinfo libva-vdpau-driver libva-utils nvidia-modprobe

Wait for the driver to build and reboot to get things up and running.

Install Required NVidia CUDA and Machine Learning Libraries

This step relies on using the official nvidia repositories with a little more information available in the RPM Fusion CUDA instructions.

First of all, add a new yum configuration file. Copy the following to /etc/yum.repos.d/nvidia.repo:

[nvidia-cuda]

name=nvidia-cuda
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/fedora27/x86_64/7fa2af80.pub
exclude=akmod-nvidia*,kmod-nvidia*,*nvidia*,nvidia-*,cuda-nvidia-kmod-common,dkms-nvidia,nvidia-libXNVCtrl

[nvidia-machine-learning]
name=nvidia-machine-learning
baseurl=http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/7fa2af80.pub
exclude=libcudnn7*.cuda10.1,libnccl*.cuda10.1

Note that the configuration above deliberately targets the fedora27 repository from NVidia. This is because it is the location at which we can find CUDA 10.0 compatible libraries rather than CUDA 10.1 libraries that will be found in later repositories. So the configuration above is likely to need to change over time but essentially the message here is that we can match the version of CUDA required by targeting the appropriate repository from NVidia. These libraries will be binary compatible with future versions of Fedora so this action should be safe to do for some time yet.

With the following configuration in place we can now install CUDA 10.0 and the machine learning libraries required for Tensorflow GPU support and all of the libraries get installed in the correct places that Tensorflow expects.

To install, run:

dnf install cuda libcudnn7 libnccl

Install Tensorflow GPU

The final piece of the puzzle is to install Tensorflow GPU which is now as easy as:

pip3 install tensorflow-gpu

Friday 16 November 2018

Building Tensorflow GPU on Fedora Linux

<update on Sept 13th 2019>
I have written another post on how to install (rather than build) Tensorflow GPU for Fedora that uses a different and much simpler method. See Installing Tensorflow GPU on Fedora Linux.
</update>

First off, let's say that there are easy ways of configuring Tensorflow for GPU usage such as using one of the docker images. However, I'm a bit old school for some things and having always done so I've recently got Tensorflow going on my machine using my GPU. Tensorflow CPU support is quite easy to do and generally works quite nicely using the pip install method. GPU support, I've always found, is quite a bit more difficult as there are a whole bunch of things that need to be at just the right level for everything to work i.e. it's quite brittle!

What follows are my notes (it's in the name of the blog) for how to build Tensorflow from scratch to enable GPU support and I do this on Fedora Linux. If you want to know why it's worth bothering going to this effort, I've tested the Keras MNIST CNN example as a bench mark. It takes:

11 minutes 7 seconds on my CPU
2 minutes 55 seconds on my GPU

That's just over 3.8 as fast on my GPU as per my CPU so for large jobs this will be huge!

Some info on my machine and config:

Lenovo P50 Laptop
Intel Core i7-6820HQ CPU @ 2.70GHz (4 core with hyper threading)
32GB RAM
Nvidia Quadro M1000M (CUDA compute capability 5.0)
Fedora 28 running kernel 4.18.18-200.fc28.x86_64

Install Required Nvidia RPMs
You need to get everything Nvidia and CUDA installed on your machine first. I quite like the Negativo17 repository for Nvidia on Fedora Linux and so I use this but you could also go with RPM Fusion or even download everything directly from Nvidia. For me, right now, I have this little lot installed:
cuda-9.2.148.1-2.fc28.x86_64
cuda-cli-tools-9.2.148.1-2.fc28.x86_64
cuda-cublas-9.2.148.1-2.fc28.x86_64
cuda-cublas-devel-9.2.148.1-2.fc28.x86_64
cuda-cudart-9.2.148.1-2.fc28.x86_64
cuda-cudart-devel-9.2.148.1-2.fc28.x86_64
cuda-cudnn-7.2.1.38-1.fc28.x86_64
cuda-cudnn-devel-7.2.1.38-1.fc28.x86_64
cuda-cufft-9.2.148.1-2.fc28.x86_64
cuda-cufft-devel-9.2.148.1-2.fc28.x86_64
cuda-cupti-9.2.148.1-2.fc28.x86_64
cuda-cupti-devel-9.2.148.1-2.fc28.x86_64
cuda-curand-9.2.148.1-2.fc28.x86_64
cuda-curand-devel-9.2.148.1-2.fc28.x86_64
cuda-cusolver-9.2.148.1-2.fc28.x86_64
cuda-cusolver-devel-9.2.148.1-2.fc28.x86_64
cuda-cusparse-9.2.148.1-2.fc28.x86_64
cuda-cusparse-devel-9.2.148.1-2.fc28.x86_64
cuda-devel-9.2.148.1-2.fc28.x86_64
cuda-docs-9.2.148.1-2.fc28.noarch
cuda-gcc-7.3.0-1.fc28.x86_64
cuda-gcc-c++-7.3.0-1.fc28.x86_64
cuda-gcc-gfortran-7.3.0-1.fc28.x86_64
cuda-libs-9.2.148.1-2.fc28.x86_64
cuda-npp-9.2.148.1-2.fc28.x86_64
cuda-npp-devel-9.2.148.1-2.fc28.x86_64
cuda-nvgraph-9.2.148.1-2.fc28.x86_64
cuda-nvgraph-devel-9.2.148.1-2.fc28.x86_64
cuda-nvml-devel-9.2.148.1-2.fc28.x86_64
cuda-nvrtc-9.2.148.1-2.fc28.x86_64
cuda-nvrtc-devel-9.2.148.1-2.fc28.x86_64
cuda-nvtx-9.2.148.1-2.fc28.x86_64
cuda-nvtx-devel-9.2.148.1-2.fc28.x86_64
nvidia-driver-cuda-libs-410.73-4.fc28.x86_64

You might wonder about some of the above, particularly why you might need a back level version of GCC. When Fedora 28 has a quite capable GCC version 8 why on earth would you want version 7? The answer lies in my comment about things being difficult or brittle, it's quite simply that CUDA doesn't yet support GCC 8 so you do need a back level compiler for this

Install NVidia NCCL
This library isn't available through an RPM installation or the Negativo17 repository and so you must:

Go to the Nvidia NCCL home page
Click the link to download NCCL (requires an Nvidia developer login account)
Agree to the Terms and Conditions
Download the NCCL zipped tar file that matches your CUDA version (9.2 for this blog post)

At the time of writing the file required is nccl_2.3.7-1+cuda9.2_x86_64.txz

I simply untar this file into /usr/local and create a symbolic link as follows:

cd /usr/local
sudo tar -xf /path/to/file/nccl_2.3.7-1+cuda9.2_x86_64.txz
sudo ln -s nccl_2.3.7-1+cuda9.2_x86_64.txz nccl

Install the Bazel Build Tool
You're going to need a build tool called Bazel which isn't directly available in the Fedora repositories (that I know of at least) but fortunately there's a version in a copr repository you can use as documented run the following commands:

dnf copr enable vbatts/bazel
dnf install bazel

Get a Copy of Tensorflow Source
For this it's just as easy to use git as it is anything else. You can directly clone the 1.12 release of Tensorflow into a new directory by running:

git clone --single-branch -b r1.12 https://github.com/tensorflow/tensorflow tensorflow-r1.12
cd tensorflow-r1.12

Simply replace r1.12 in the above commands if you want to use a different Tensorflow release.

Run the Tensorflow Configure Script
This step is actually quite simple but you'll need the answers to some questions to hand, simply run:

./configure

I accept all the default options with the exception of:

"location of python" set to /usr/bin/python3 since Fedora still uses Python 2.7 as the default version at /usr/bin/python
"build TensorFlow with CUDA support" set to Yes
"CUDA SDK version" set to 9.2 (this value should match the cuda version you have installed and at the time of writing 9.2 is the current version from the Negativo17 repository)
"location where CUDA 9.2 toolkit is installed" set to /usr
"cuDNN version" set to 7.2 (similar to the cuda version above, this value should match the cuda-cudnn package version and 7.2 is the current version from the Negativo17 repository)
"NCCL version" set to 2.3
"location where NCCL 2 library is installed" set to /usr/local/nccl
"Cuda compute capabilities you want to build with" set to 5.0 (but this value should match the CUDA compute capability of the GPU in the machine you're building for)
"which gcc" set to /usr/bin/cuda-gcc (to use the back level GCC version 7)

Fix Bazel Config

The above config command writes a file but the location isn't compatible with the latest version of Bazel. Presumably this issue will be fixed at some point in the future, it's not an issue with Bazel 0.18 and below as far as I'm aware, but has just become an issue on 0.19. Simply copy the config to the correct place:

cat tools/bazel.rc >> .tf_configure.bazelrc

Build Tensorflow with GPU Support

This took around an hour to complete on my machine:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

The first step is the long one for the build, the second simply builds the python wheel file.

Install Tensorflow with GPU Support

You've got your wheel file so simply install and enjoy:

pip3 install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl

Run Some Code

The first time I attempted to run some code to test I got an error:

failed call to cuInit: CUDA_ERROR_UNKNOWN

This can be solved by making sure you have the nvidia-modprobe package installed. Alternatively, you can run the little script below the following explanation.

This seems to be some sort of permissions issue and running the following simple script to output the GPUs available on my machine but as root seems to have fixed the above issue i.e. put the following into a script, run that script as root, then any time you want to run code as an unprivileged user the above issue is fixed and the code will work:

from keras import backend as K

K.tensorflow_backend._get_available_gpus()

If the above works then you can try out the Keras MNIST CNN example code.

Friday 14 November 2014

Tackling Cancer with Machine Learning

For a recent Hack Day at work I spent some time working with one of my colleagues, Adrian Lee, on a little side project to see if we could detect cancer cells in a biopsy image. We've only spent a couple of days on this so far but already the results are looking very promising with each of us working on a distinctly different part of the overall idea.

We held an open day in our department at work last month and I gave a lightening talk on the subject which you can see on YouTube:

There were a whole load of other talks given on the day that can be seen in the summary blog post over on the ETS (Emerging Technology Services) site.

Tuesday 25 June 2013

Machine Learning Course

Enough time has passed since I undertook the Stanford University Natural Language Processing Course for me to forget just how much hard work it was for me to start all over again. This year I decided to have a go at the coursera Machine Learning Course.

Unlike the 12 week NLP course last year which estimated 10 hours a week and turned out to be more like 15-20 hours a week, this course was much more realistic in estimation at 10 weeks of 8 hours. I think I more or less hit the mark on that point spending about 1 day every week for the past 10 weeks studying machine learning - so around half the time required for the NLP course.

The course was written and presented by Andrew Ng who seems to be rather prolific and somewhat of an academic star in his fields of machine learning and artificial intelligence. He is one of the co-founders of the coursera site which along with their main rival, Udacity, have brought about the popular rise of Massive Open Online Learning.

The Machine Learning Course followed the same format as the NLP course from last year which I can only assume is the standard coursera format, at least for technical courses anyway. Each week there were 1 or two main topic areas to study which were presented in a series of videos featuring Andrew talking through a set of slides on which he's able to hand write notes for demonstration purposes, just as if you're sitting in a real lecture hall at university. To check your understanding of the content of the videos there are questions which must be answered on each topic against which you're graded. The second main component each week is a programming exercise which for the Machine Learning Course must be completed in Octave - so yet another programming language to add to your list. Achieving a mark of 80% or above across all the questions and programming exercises results in a course pass. I appear to have done that with relative ease for this course.

The 18 topics covered were:

Introduction
Linear Regression with One Variable
Linear Algebra Review
Linear Regression with Multiple Variables
Octave Tutorial
Logistic Regression
Regularisation
Neural Networks Representation
Neural Networks Learning
Advice for Applying Machine Learning
Machine Learning System Design
Support Vector Machines
Clustering
Dimensionality Reduction
Anomaly Detection
Recommender Systems
Large Scale Machine Learning
Application Example Photo OCR

The course served as a good revision of some maths I haven't used in quite some time, lots of Linear Algebra for which you need a pretty good understanding and lots of calculus which you didn't really need to understand if all you care about is implementing the algorithms rather than working out how they're derived or proven. Being quite maths based, the course used matrices and vectorisation very heavily rather than using the loop structures that most of us would use as a go-to framework for writing complex algorithms. Again, this was some good revision as I've not programmed in this fashion for quite some time. You're definitely reminded of just how efficient you can make complex tasks on modern processors if you stand back from your algorithm for a bit and work out how best to utilise the hardware (via the appropriately optimised libraries) you have.

The major thought behind the course seems to be to teach as many different algorithms as possible. There really is a great range. Starting of simply with linear algorithms and progressing right up to the current state-of-the-art Neural Networks and the ever fashionable map-reduce stuff.

I didn't find the course terribly difficult, I'm no expert in any of the topics but have studied enough maths not to struggle with that side of things and don't struggle with programming either. I didn't need to use the forums or any of the other social elements offered during the course so I don't really have a feel for how others found the course. I can certainly imagine someone finding it a real struggle if they don't have a particularly deep background in either maths or programming.

There was, as far as I can think right now, one (or maybe two depending on how you count) omission from the course. Most of the programming exercises were heavily frameworked for you in advance, you just have to fill in the gaps. This is great for learning the various different algorithms presented during the course but does leave a couple of areas at the end of the course you're not so confident with (aside from not really having a wide grasp of the Octave programming language). The omission of which I speak is that of storing and bootstrapping the models you've trained with the algorithm. All the exercises concentrated on training a model, storing it in memory, using it and as the program terminates then so your model disappears. It would have been great to have another module on the best ways to persist models between program runs, and how to continue training (bootstrap) a model that you have already persisted. I'll feed that thought back to Andrew when the opportunity arises over the next couple of weeks.

The problem going forward wont so much be applying what has been offered here but working out what to apply it to. The range of problems that can be tackled with these techniques is mind-blowing, just look at the rise of analytics we're seeing in all areas of business and technology.

Overall then, a really nice introduction into the world of machine learning. Recommended!