Graham White: My Notes: eightbar

Showing posts with label eightbar. Show all posts

Friday 14 November 2014

Tackling Cancer with Machine Learning

For a recent Hack Day at work I spent some time working with one of my colleagues, Adrian Lee, on a little side project to see if we could detect cancer cells in a biopsy image. We've only spent a couple of days on this so far but already the results are looking very promising with each of us working on a distinctly different part of the overall idea.

We held an open day in our department at work last month and I gave a lightening talk on the subject which you can see on YouTube:

There were a whole load of other talks given on the day that can be seen in the summary blog post over on the ETS (Emerging Technology Services) site.

Tuesday 25 June 2013

Machine Learning Course

Enough time has passed since I undertook the Stanford University Natural Language Processing Course for me to forget just how much hard work it was for me to start all over again. This year I decided to have a go at the coursera Machine Learning Course.

Unlike the 12 week NLP course last year which estimated 10 hours a week and turned out to be more like 15-20 hours a week, this course was much more realistic in estimation at 10 weeks of 8 hours. I think I more or less hit the mark on that point spending about 1 day every week for the past 10 weeks studying machine learning - so around half the time required for the NLP course.

The course was written and presented by Andrew Ng who seems to be rather prolific and somewhat of an academic star in his fields of machine learning and artificial intelligence. He is one of the co-founders of the coursera site which along with their main rival, Udacity, have brought about the popular rise of Massive Open Online Learning.

The Machine Learning Course followed the same format as the NLP course from last year which I can only assume is the standard coursera format, at least for technical courses anyway. Each week there were 1 or two main topic areas to study which were presented in a series of videos featuring Andrew talking through a set of slides on which he's able to hand write notes for demonstration purposes, just as if you're sitting in a real lecture hall at university. To check your understanding of the content of the videos there are questions which must be answered on each topic against which you're graded. The second main component each week is a programming exercise which for the Machine Learning Course must be completed in Octave - so yet another programming language to add to your list. Achieving a mark of 80% or above across all the questions and programming exercises results in a course pass. I appear to have done that with relative ease for this course.

The 18 topics covered were:

Introduction
Linear Regression with One Variable
Linear Algebra Review
Linear Regression with Multiple Variables
Octave Tutorial
Logistic Regression
Regularisation
Neural Networks Representation
Neural Networks Learning
Advice for Applying Machine Learning
Machine Learning System Design
Support Vector Machines
Clustering
Dimensionality Reduction
Anomaly Detection
Recommender Systems
Large Scale Machine Learning
Application Example Photo OCR

The course served as a good revision of some maths I haven't used in quite some time, lots of Linear Algebra for which you need a pretty good understanding and lots of calculus which you didn't really need to understand if all you care about is implementing the algorithms rather than working out how they're derived or proven. Being quite maths based, the course used matrices and vectorisation very heavily rather than using the loop structures that most of us would use as a go-to framework for writing complex algorithms. Again, this was some good revision as I've not programmed in this fashion for quite some time. You're definitely reminded of just how efficient you can make complex tasks on modern processors if you stand back from your algorithm for a bit and work out how best to utilise the hardware (via the appropriately optimised libraries) you have.

The major thought behind the course seems to be to teach as many different algorithms as possible. There really is a great range. Starting of simply with linear algorithms and progressing right up to the current state-of-the-art Neural Networks and the ever fashionable map-reduce stuff.

I didn't find the course terribly difficult, I'm no expert in any of the topics but have studied enough maths not to struggle with that side of things and don't struggle with programming either. I didn't need to use the forums or any of the other social elements offered during the course so I don't really have a feel for how others found the course. I can certainly imagine someone finding it a real struggle if they don't have a particularly deep background in either maths or programming.

There was, as far as I can think right now, one (or maybe two depending on how you count) omission from the course. Most of the programming exercises were heavily frameworked for you in advance, you just have to fill in the gaps. This is great for learning the various different algorithms presented during the course but does leave a couple of areas at the end of the course you're not so confident with (aside from not really having a wide grasp of the Octave programming language). The omission of which I speak is that of storing and bootstrapping the models you've trained with the algorithm. All the exercises concentrated on training a model, storing it in memory, using it and as the program terminates then so your model disappears. It would have been great to have another module on the best ways to persist models between program runs, and how to continue training (bootstrap) a model that you have already persisted. I'll feed that thought back to Andrew when the opportunity arises over the next couple of weeks.

The problem going forward wont so much be applying what has been offered here but working out what to apply it to. The range of problems that can be tackled with these techniques is mind-blowing, just look at the rise of analytics we're seeing in all areas of business and technology.

Overall then, a really nice introduction into the world of machine learning. Recommended!

Monday 22 April 2013

Speech to Text

Apologies to the tl;dr brigade, this is going to be a long one...

For a number of years I've been quietly working away with IBM research on our speech to text programme. That is, working with a set of algorithms that ultimately produce a system capable of listening to human speech and transcribing it into text. The concept is simple, train a system for speech to text - speech goes in, text comes out. However, the process and algorithms to do this are extremely complicated from just about every way you look at it – computationally, mathematically, operationally, evaluationally, time and cost. This is a completely separate topic and area of research from the similar sounding text to speech systems that take text (such as this blog) and read it aloud in a computerised voice.

Whenever I talk to people about it they always appear fascinated and want to know more. The same questions often come up. I'm going to address some of these here in a generic way and leaving out those that I'm unable to talk about here. I should also point out that I'm by no means a speech expert or linguist but have developed enough of an understanding to be dangerous in the subject matter and that (I hope) allows me to explain things in a way that others not familiar with the field are able to understand. I'm deliberately not linking out to the various research topics that come into play during this post as the list would become lengthy very quickly and this isn't a formal paper after all, Internet searches are your friend if you want to know more.

I didn't know IBM did that?

OK so not strictly a question but the answer is yes, we do. We happen to be pretty good at it as well. However, we typically use a company called Nuance as our preferred partner.

People have often heard of IBM's former product in this area called Via Voice for their desktop PCs which was available until the early 2000's. This sort of technology allowed a single user to speak to their computer for various different purposes and required the user to spend some time training the software before it would understand their particular voice. Today's speech software has progressed beyond this to systems that don't require any training by the user before they use it. Current systems are trained in advance in order to attempt to understand any voice.

What's required?

Assuming you have the appropriate software and the hardware required to run it on then you need three more things to build a speech to text system: audio, transcripts and a phonetic dictionary of pronunciations. This sounds quite simple but when you dig under the covers a little you realise it's much more complicated (not to mention expensive) and the devil is very much in the detail.

On the audial side you'll need a set of speech recordings. If you want to evaluate your system after it has been trained then a small sample of these should be kept to one side and not used during the training process. This set of audio used for evaluation is usually termed the held out set. It's considered cheating if you later evaluate the system using audio that was included in the training process – since the system has already “heard” this audio before it would have a higher chance of accurately reproducing it later. The creation of the held out set leads to two sets of audio files, the held out set and the majority of the audio that remains which is called the training set.

The audio can be in any format your training software is compatible with but wave files are commonly used. The quality of the audio both in terms of the digital quality (e.g. sample rate) as well as the quality of the speaker(s) and the equipment used for the recordings will have a direct bearing on the resulting accuracy of the system being trained. Simply put, the better quality you can make the input, the more accurate the output will be. This leads to another bunch of questions such as but not limited to “What quality is optimal?”, “What should I get the speakers to say?”, “How should I capture the recordings?” - all of which are research topics in their own right and for which there is no one-size-fits-all answer.

Capturing the audio is one half of the battle. The next piece in the puzzle is obtaining well transcribed textual copies of that audio. The transcripts should consist of a set of text representing what was said in the audio as well as some sort of indication of when during the audio a speaker starts speaking and when they stop. This is usually done on a sentence by sentence basis, or for each utterance as they are known. These transcripts may have a certain amount of subjectivity associated with them in terms of where the sentence boundaries are and potentially exactly what was said if the audio wasn't clear or slang terms were used. They can be formatted in a variety of different ways and there are various standard formats for this purpose from an XML DTD through to CSV.

If it has not already become clear, creating the transcription files can be quite a skilled and time consuming job. A typical industry expectation is that it takes approximately 10 man-hours for a skilled transcriber to produce 1 hour of well formatted audio transcription. This time plus the cost of collecting the audio in the first place is one of the factors making speech to text a long, hard and expensive process. This is particularly the case when put into context that most current commercial speech systems are trained on at least 2000+ hours of audio with the minimum recommended amount being somewhere in the region of 500+ hours.

Finally, a phonetic dictionary must either be obtained or produced that contains at least one pronunciation variant for each word said across the entire corpus of audio input. Even for a minimal system this will run into tens of thousands of words. There are of course, already phonetic dictionaries available such as the Oxford English Dictionary that contains a pronunciation for each word it contains. However, this would only be appropriate for one regional accent or dialect without variation. Hence, producing the dictionary can also be a long and skilled manual task.

What does the software do?

The simple answer is that it takes audio and transcript files and passes them through a set of really rather complicated mathematical algorithms to produce a model that is particular to the input received. This is the training process. Once system has been trained the model it generates can be used to take speech input and produce text output. This is the decoding process. The training process requires lots of data and is computationally expensive but the model it produces is very small and computationally much less expensive to run. Today's models are typically able to perform real-time (or faster) speech to text conversion on a single core of a modern CPU. It is the model and software surrounding the model that is the piece exposed to users of the system.

Various different steps are used during the training process to iterate through the different modelling techniques across the entire set of training audio provided to the trainer. When the process first starts the software knows nothing of the audio, there are no clever boot strapping techniques used to kick-start the system in a certain direction or pre-load it in any way. This allows the software to be entirely generic and work for all sorts of different languages and quality of material. Starting in this way is known as a flat start or context independent training. The software simply chops up the audio into regular segments to start with and then performs several iterations where these boundaries are shifted slightly to match the boundaries of the speech in the audio more closely.

The next phase is context dependent training. This phase starts to make the model a little more specific and tailored to the input being given to the trainer. The pronunciation dictionary is used to refine the model to produce an initial system that could be used to decode speech into text in its own right at this early stage. Typically, context dependent training, while an iterative process in itself, can also be run multiple times in order to hone the model still further.

Another optimisation that can be made to the model after context dependent training is to apply vocal tract length normalisation. This works on the theory that the audibility of human speech correlates to the pitch of the voice, and the pitch of the voice correlates to the vocal tract length of the speaker. Put simply, it's a theory that says men have low voices and women have high voices and if we normalise the wave form for all voices in the training material to have the same pitch (i.e. same vocal tract length) then audibility improves. To do this an estimation of the vocal tract length must first be made for each speaker in the training data such that a normalisation factor can be applied to that material and the model updated to reflect the change.

The model can be thought of as a tree although it's actually a large multi-dimensional matrix. By reducing the number of dimensions in the matrix and applying various other mathematical operations to reduce the search space the model can be further improved upon both in terms of accuracy, speed and size. This is generally done after vocal tract length normalisation has taken place.

Another tweak that can be made to improve the model is to apply what we call discriminative training. For this step the theory goes along the lines that all of the training material is decoded using the current best model produced from the previous step. This produces a set of text files. These text files can be compared with those produced by the human transcribers and given to the system as training material. The comparison can be used to inform where the model can be improved and these improvements applied to the model. It's a step that can probably be best summarised by learning from its mistakes, clever!

Finally, once the model has been completed it can be used with a decoder that knows how to understand that model to produce text given an audio input. In reality, the decoders tend to operate on two different models. The audio model for which the process of creation has just been roughly explained; and a language model. The language model is simply a description of how language is used in the specific context of the training material. It would, for example, attempt to provide insight into which words typically follow which other words via the use of what natural language processing experts call n-grams. Obtaining information to produce the language model is much easier and does not necessarily have to come entirely from the transcripts used during the training process. Any text data that is considered representative of the speech being decoded could be useful. For example, in an application targeted at decoding BBC News readers then articles from the BBC news web site would likely prove a useful addition to the language model.

How accurate is it?

This is probably the most common question about these systems and one of the most complex to answer. As with most things in the world of high technology it's not simple, so the answer is the infamous “it depends”. The short answer is that in ideal circumstances the software can perform at near human levels of accuracy which equates to in excess of 90% accuracy levels. Pretty good you'd think. It has been shown that human performance is somewhere in excess of 90% and is almost never 100% accuracy. The test for this is quite simple, you get two (or more) people to independently transcribe some speech and compare the results from each speaker, almost always there will be a disagreement about some part of the speech (if there's enough speech that is).

It's not often that ideal circumstances are present or can even realistically be achieved. Ideal would be transcribing a speaker with a similar voice and accent to those which have been trained into the model and they would speak at the right speed (not too fast and not too slowly) and they would use a directional microphone that didn't do any fancy noise cancellation, etc. What people are generally interested in is the real-world situation, something along the lines of “if I speak to my phone, will it understand me?”. This sort of real-world environment often includes background noise and a very wide variety of speakers potentially speaking into a non-optimal recording device. Even this can be a complicated answer for the purposes of accuracy. We're talking about free, conversational style, speech in this blog post and there's a huge different in recognising any and all words versus recognising a small set of command and control words for if you wanted your phone to perform a specific action. In conclusion then, we can only really speak about the art of the possible and what has been achieved before. If you want to know about accuracy for your particular situation and your particular voice on your particular device then you'd have to test it!

What words can it understand? What about slang?

The range of understanding of a speech to text system is dependent on the training material. At present, the state of the art systems are based on dictionaries of words and don't generally attempt to recognise new words for which an entry in the dictionary has not been found (although these types of systems are available separately and could be combined into a speech to text solution if necessary). So the number and range of words understood by a speech to text system is currently (and I'm generalising here) a function of the number and range of words used in the training material. It doesn't really matter what these words are, whether they're conversational and slang terms or proper dictionary terms, so long as the system was trained on those then it should be able to recognise them again during a decode.

Updates and Maintenance

For the more discerning reader, you'll have realised by now a fundamental flaw in the plan laid out thus far. Language changes over time, people use new words and the meaning of words changes within the language we use. Text-speak is one of the new kids on the block in this area. It would be extremely cumbersome to need to train an entire new model each time you wished to update your previous one in order to include some set of new language capability. The models produced are able to be modified and updated with these changes without the need to go back to a full standing start and training from scratch all over again. It's possible to take your existing model built from the set of data you had available at a particular point in time and use this to bootstrap the creation of a new model which will be enhanced with the new materials that you've gathered since training the first model. Of course, you'll want to test and compare both models to check that you have in fact enhanced performance as you were expecting. This type of maintenance and update to the model will be required to any and all of these types of systems as they're currently designed as the structure and usage of our languages evolve.

Conclusion

OK, so not necessarily a blog post that was ever designed to draw a conclusion but I wanted to wrap up by saying that this is an area of technology that is still very much in active research and development, and has been so for at least 40-50 years or more! There's a really interesting statistic I've seen in the field that says if you ask a range of people involved in this topic the answer to the question “when will speech to text become a reality” then the answer generally comes out at “in ten years time”. This question has been asked consistently over time and the answer has remained the same. It seems then, that either this is a really hard nut to crack or that our expectations of such a system move on over time. Either way, it seems there will always be something new just around the corner to advance us to the next stage of speech technologies.

Tuesday 19 March 2013

Going Back to University

A couple of weeks ago I had the enormous pleasure of returning to Exeter University where I studied for my degree more years ago than seems possible. Getting involved with the uni again has been something I've long since wanted to do in an attempt to give back something to the institution to which I owe so much having been there to get good qualifications and not least met my wife there too! I think early on in a career it's not necessarily something I would have been particularly useful for since I was closer to the university than my working life in age, mentality and a bunch of other factors I'm sure. However, getting a bit older makes me feel readier to provide something tangibly useful in terms of giving something back both to the university and to the current students. I hope that having been there recently with work it's a relationship I can start to build up.

I should probably steer clear of saying exactly why we were there but there was a small team from work some of which I knew well such as @madieq and @andysc and one or two I hadn't come across before. Our job was to work with some academic staff for a couple of days and so it was a bit of a departure from my normal work with corporate customers. It's fantastic to see the university from the other side of the fence (i.e. not being a student) and hearing about some of the things going on there and seeing a university every bit as vibrant and ambitious as the one I left in 2000. Of course, there was the obligatory wining and dining in the evening which just went to make the experience all the more pleasurable.

I really hope to be able to talk a lot more about things we're doing with the university in the future. Until then, I'm looking forward to going back a little more often and potentially imparting some words (of wisdom?) to some students too.

Sunday 27 May 2012

Natural Language Processing Course

Over the first few months of this year I have been taking part in a mass online learning course in Natural Language Processing (NLP) run by Stanford University. They publicised a group of eight courses at the end of last year and I didn't hesitate to sign up to the Natural Language Processing course knowing it would fit very well with things I'm working on in my professional role where I'm doing more and more with text analytics and continuing my work in speech to text. There were others I could easily have signed up for too, things like security or machine learning, more or less all of them are relevant for something I'm doing. However, given the time commitment required I decided to fully commit to one course and the NLP one was to be it.

I passed the course with a grade of 85% which was well above the required 70% pass mark. However, the effort and time required to get there was way more than I was expecting and quite a lot more than the expected time the lecturers (Chris Manning and Dan Jurafsky) had said. From memory it was an 8 week course with 10 hours a week required effort to complete the work. As it went on the amount of time required went up significantly, so rather than the 80 hours total I think I spent more like 1½ times that at over 120 hours!

There were four of us at work (that I know of) who embarked on the course but due to the commitment of time I've mentioned above only myself and Dale finished. By the way, Dale has written an excellent post on the structure and content of the course so I'd suggest reading his blog for more details on that stuff, there's little point in me re-posting it as he's written such a good summary.

In terms of the participants on the course, it seems to have been quite a success for Stanford University - this is the first time they have run courses in this way it seems. The lecturers gave us some statistics at a couple of strategic points throughout the course and it seems there were around 40,000 people registering an interest, of which around 5000 were watching the lecture material and around 2000 completed the course having taken part in the homework assignments.

I'm glad I committed as much as I did. If I were one of the 5000 just watching the lectures and not doing the homework material I don't think I would have got as much out of it, but the added time required to complete the homework was significant so perhaps there's a trade-off here? It's certainly the first time I've committed this much of my own personal time (it took over the lives of myself and Dale for quite a few weeks) as I was too busy at work to spend many business hours working on the course so it was all done in evenings and weekends. That's certainly one piece of feedback I gave at the end of the course, Stanford could make the course timing more flexible but also allow more time for the course to be completed.

My experience with the way the assignments were marked was a little different to the way Dale has described in his post. I was already very familiar with the concepts of test, development and held-out sets (three different sets of data used when training NLP systems) so wasn't surprised to see that the modules in the course didn't necessarily have an exact answer to them or more precisely that the code your wrote to perfectly analyse some data on your local system may not get full marks as it was marked against a different data set. This may seem unfair but is common practice in all NLP system training that I know of.

All in all, an excellent course that I'm glad I did. From what I hear of the other courses, they're not as deeply involved as the NLP course so I may well give another one a go in the future but for now I need to get a little of my life back and have a well earned rest from education.

Thursday 1 March 2012

Failing to Invent

We IBM employees are encouraged, indeed incented, to be innovative and to invent. This is particularly poignant for people like myself working on the leading edge of the latest technologies. I work in IBM emerging technologies which is all about taking the latest available technology to our customers. We do this in a number of different ways but that's a blog post in itself. Innovation is often confused for or used interchangeably with invention but they are different, invention for IBM means patents, patenting and the patent process. That is, if I come up with something inventive I'm very much encouraged to protect that idea using patents and there are processes and help available to allow me to do that.

This comic strip really sums up what can often happen when you investigate protecting one of your ideas with a patent. It struck me recently while out to dinner with friends that there's nothing wrong with failing to invent as the cartoon above says Leibniz did. It's the innovation that's important here and unlucky for Leibniz that he wasn't seen to be inventing. It can be quite difficult to think of something sufficiently new that it is patent-worthy and this often happens to me and those I work with while trying to protect our own ideas.

The example I was drawing upon on this occasion was an idea I was discussing at work with some colleagues about a certain usage of your mobile phone [I'm being intentionally vague here]. After thinking it all through we came to the realisation that while the idea was good and the solution innovative, all the technology was already known available and assembled in the way we were proposing, but used somewhere completely different.

So, failing to invent is no bad thing. We tried and on this particular occasion decided we could innovate but not invent. Next time things could be the other way around but according to these definitions we shouldn't be afraid to innovate at the price of invention anyway.

Tuesday 3 June 2008

Showing Off Linux

Thanks to Ian Hughes for the picture on his flickr. Yesterday, at work, the Hursley Linux Special Interest Group ran a little trade show type event for a couple of hours after lunch. The idea was to provide a bit of away from your desk time for folks around the lab to see what we Linux geeks have been getting up to. Various people interested in using Linux inside and outside work came along to demo their gadgets.

The picture shows me showing off my old Linux audio centre. But, also at the event were the main organiser of the day Jon Levell (showing Fedora 9 and an eeepc), and Nick O'Leary (showing his N800 and various arduino gadgets), Gareth Jones (showing his accelerometer based USB rocket launcher and bluetooth tweetjects), Andy Stanford-Clark (showing his NSLU2 driven house, and an OLPC), Laura Cowen (showing an OLPC), Steve Godwin (showing MythTV), and Chris law (showing Amora).

I thought it was quite a nice little selection of Linux related stuff to look through for the masses of people turning up, plenty of other things we could have shown too of course. The afternoon seemed very much a success, generating some real interest in the various demo items and lots of interesting questions too. Thanks to everyone for taking part!