Monday, 22 April 2013

Speech to Text

Apologies to the tl;dr brigade, this is going to be a long one... 

For a number of years I've been quietly working away with IBM research on our speech to text programme. That is, working with a set of algorithms that ultimately produce a system capable of listening to human speech and transcribing it into text. The concept is simple, train a system for speech to text - speech goes in, text comes out. However, the process and algorithms to do this are extremely complicated from just about every way you look at it – computationally, mathematically, operationally, evaluationally, time and cost. This is a completely separate topic and area of research from the similar sounding text to speech systems that take text (such as this blog) and read it aloud in a computerised voice.

Whenever I talk to people about it they always appear fascinated and want to know more. The same questions often come up. I'm going to address some of these here in a generic way and leaving out those that I'm unable to talk about here. I should also point out that I'm by no means a speech expert or linguist but have developed enough of an understanding to be dangerous in the subject matter and that (I hope) allows me to explain things in a way that others not familiar with the field are able to understand. I'm deliberately not linking out to the various research topics that come into play during this post as the list would become lengthy very quickly and this isn't a formal paper after all, Internet searches are your friend if you want to know more.

I didn't know IBM did that?
OK so not strictly a question but the answer is yes, we do. We happen to be pretty good at it as well. However, we typically use a company called Nuance as our preferred partner.

People have often heard of IBM's former product in this area called Via Voice for their desktop PCs which was available until the early 2000's. This sort of technology allowed a single user to speak to their computer for various different purposes and required the user to spend some time training the software before it would understand their particular voice. Today's speech software has progressed beyond this to systems that don't require any training by the user before they use it. Current systems are trained in advance in order to attempt to understand any voice.

What's required?
Assuming you have the appropriate software and the hardware required to run it on then you need three more things to build a speech to text system: audio, transcripts and a phonetic dictionary of pronunciations. This sounds quite simple but when you dig under the covers a little you realise it's much more complicated (not to mention expensive) and the devil is very much in the detail.

On the audial side you'll need a set of speech recordings. If you want to evaluate your system after it has been trained then a small sample of these should be kept to one side and not used during the training process. This set of audio used for evaluation is usually termed the held out set. It's considered cheating if you later evaluate the system using audio that was included in the training process – since the system has already “heard” this audio before it would have a higher chance of accurately reproducing it later. The creation of the held out set leads to two sets of audio files, the held out set and the majority of the audio that remains which is called the training set.

The audio can be in any format your training software is compatible with but wave files are commonly used. The quality of the audio both in terms of the digital quality (e.g. sample rate) as well as the quality of the speaker(s) and the equipment used for the recordings will have a direct bearing on the resulting accuracy of the system being trained. Simply put, the better quality you can make the input, the more accurate the output will be. This leads to another bunch of questions such as but not limited to “What quality is optimal?”, “What should I get the speakers to say?”, “How should I capture the recordings?” - all of which are research topics in their own right and for which there is no one-size-fits-all answer.

Capturing the audio is one half of the battle. The next piece in the puzzle is obtaining well transcribed textual copies of that audio. The transcripts should consist of a set of text representing what was said in the audio as well as some sort of indication of when during the audio a speaker starts speaking and when they stop. This is usually done on a sentence by sentence basis, or for each utterance as they are known. These transcripts may have a certain amount of subjectivity associated with them in terms of where the sentence boundaries are and potentially exactly what was said if the audio wasn't clear or slang terms were used. They can be formatted in a variety of different ways and there are various standard formats for this purpose from an XML DTD through to CSV.

If it has not already become clear, creating the transcription files can be quite a skilled and time consuming job. A typical industry expectation is that it takes approximately 10 man-hours for a skilled transcriber to produce 1 hour of well formatted audio transcription. This time plus the cost of collecting the audio in the first place is one of the factors making speech to text a long, hard and expensive process. This is particularly the case when put into context that most current commercial speech systems are trained on at least 2000+ hours of audio with the minimum recommended amount being somewhere in the region of 500+ hours.

Finally, a phonetic dictionary must either be obtained or produced that contains at least one pronunciation variant for each word said across the entire corpus of audio input. Even for a minimal system this will run into tens of thousands of words. There are of course, already phonetic dictionaries available such as the Oxford English Dictionary that contains a pronunciation for each word it contains. However, this would only be appropriate for one regional accent or dialect without variation. Hence, producing the dictionary can also be a long and skilled manual task.

What does the software do?
The simple answer is that it takes audio and transcript files and passes them through a set of really rather complicated mathematical algorithms to produce a model that is particular to the input received. This is the training process. Once system has been trained the model it generates can be used to take speech input and produce text output. This is the decoding process. The training process requires lots of data and is computationally expensive but the model it produces is very small and computationally much less expensive to run. Today's models are typically able to perform real-time (or faster) speech to text conversion on a single core of a modern CPU. It is the model and software surrounding the model that is the piece exposed to users of the system.

Various different steps are used during the training process to iterate through the different modelling techniques across the entire set of training audio provided to the trainer. When the process first starts the software knows nothing of the audio, there are no clever boot strapping techniques used to kick-start the system in a certain direction or pre-load it in any way. This allows the software to be entirely generic and work for all sorts of different languages and quality of material. Starting in this way is known as a flat start or context independent training. The software simply chops up the audio into regular segments to start with and then performs several iterations where these boundaries are shifted slightly to match the boundaries of the speech in the audio more closely.

The next phase is context dependent training. This phase starts to make the model a little more specific and tailored to the input being given to the trainer. The pronunciation dictionary is used to refine the model to produce an initial system that could be used to decode speech into text in its own right at this early stage. Typically, context dependent training, while an iterative process in itself, can also be run multiple times in order to hone the model still further.

Another optimisation that can be made to the model after context dependent training is to apply vocal tract length normalisation. This works on the theory that the audibility of human speech correlates to the pitch of the voice, and the pitch of the voice correlates to the vocal tract length of the speaker. Put simply, it's a theory that says men have low voices and women have high voices and if we normalise the wave form for all voices in the training material to have the same pitch (i.e. same vocal tract length) then audibility improves. To do this an estimation of the vocal tract length must first be made for each speaker in the training data such that a normalisation factor can be applied to that material and the model updated to reflect the change.

The model can be thought of as a tree although it's actually a large multi-dimensional matrix. By reducing the number of dimensions in the matrix and applying various other mathematical operations to reduce the search space the model can be further improved upon both in terms of accuracy, speed and size. This is generally done after vocal tract length normalisation has taken place.

Another tweak that can be made to improve the model is to apply what we call discriminative training. For this step the theory goes along the lines that all of the training material is decoded using the current best model produced from the previous step. This produces a set of text files. These text files can be compared with those produced by the human transcribers and given to the system as training material. The comparison can be used to inform where the model can be improved and these improvements applied to the model. It's a step that can probably be best summarised by learning from its mistakes, clever!

Finally, once the model has been completed it can be used with a decoder that knows how to understand that model to produce text given an audio input. In reality, the decoders tend to operate on two different models. The audio model for which the process of creation has just been roughly explained; and a language model. The language model is simply a description of how language is used in the specific context of the training material. It would, for example, attempt to provide insight into which words typically follow which other words via the use of what natural language processing experts call n-grams. Obtaining information to produce the language model is much easier and does not necessarily have to come entirely from the transcripts used during the training process. Any text data that is considered representative of the speech being decoded could be useful. For example, in an application targeted at decoding BBC News readers then articles from the BBC news web site would likely prove a useful addition to the language model.

How accurate is it?
This is probably the most common question about these systems and one of the most complex to answer. As with most things in the world of high technology it's not simple, so the answer is the infamous “it depends”. The short answer is that in ideal circumstances the software can perform at near human levels of accuracy which equates to in excess of 90% accuracy levels. Pretty good you'd think. It has been shown that human performance is somewhere in excess of 90% and is almost never 100% accuracy. The test for this is quite simple, you get two (or more) people to independently transcribe some speech and compare the results from each speaker, almost always there will be a disagreement about some part of the speech (if there's enough speech that is).

It's not often that ideal circumstances are present or can even realistically be achieved. Ideal would be transcribing a speaker with a similar voice and accent to those which have been trained into the model and they would speak at the right speed (not too fast and not too slowly) and they would use a directional microphone that didn't do any fancy noise cancellation, etc. What people are generally interested in is the real-world situation, something along the lines of “if I speak to my phone, will it understand me?”. This sort of real-world environment often includes background noise and a very wide variety of speakers potentially speaking into a non-optimal recording device. Even this can be a complicated answer for the purposes of accuracy. We're talking about free, conversational style, speech in this blog post and there's a huge different in recognising any and all words versus recognising a small set of command and control words for if you wanted your phone to perform a specific action. In conclusion then, we can only really speak about the art of the possible and what has been achieved before. If you want to know about accuracy for your particular situation and your particular voice on your particular device then you'd have to test it!

What words can it understand? What about slang?
The range of understanding of a speech to text system is dependent on the training material. At present, the state of the art systems are based on dictionaries of words and don't generally attempt to recognise new words for which an entry in the dictionary has not been found (although these types of systems are available separately and could be combined into a speech to text solution if necessary). So the number and range of words understood by a speech to text system is currently (and I'm generalising here) a function of the number and range of words used in the training material. It doesn't really matter what these words are, whether they're conversational and slang terms or proper dictionary terms, so long as the system was trained on those then it should be able to recognise them again during a decode.

Updates and Maintenance
For the more discerning reader, you'll have realised by now a fundamental flaw in the plan laid out thus far. Language changes over time, people use new words and the meaning of words changes within the language we use. Text-speak is one of the new kids on the block in this area. It would be extremely cumbersome to need to train an entire new model each time you wished to update your previous one in order to include some set of new language capability. The models produced are able to be modified and updated with these changes without the need to go back to a full standing start and training from scratch all over again. It's possible to take your existing model built from the set of data you had available at a particular point in time and use this to bootstrap the creation of a new model which will be enhanced with the new materials that you've gathered since training the first model. Of course, you'll want to test and compare both models to check that you have in fact enhanced performance as you were expecting. This type of maintenance and update to the model will be required to any and all of these types of systems as they're currently designed as the structure and usage of our languages evolve.

OK, so not necessarily a blog post that was ever designed to draw a conclusion but I wanted to wrap up by saying that this is an area of technology that is still very much in active research and development, and has been so for at least 40-50 years or more! There's a really interesting statistic I've seen in the field that says if you ask a range of people involved in this topic the answer to the question “when will speech to text become a reality” then the answer generally comes out at “in ten years time”. This question has been asked consistently over time and the answer has remained the same. It seems then, that either this is a really hard nut to crack or that our expectations of such a system move on over time. Either way, it seems there will always be something new just around the corner to advance us to the next stage of speech technologies.

Tuesday, 16 April 2013

Killer Android Apps

This is my second blog post with this title with the first one having appeared in December 2010.  I thought it would be good to look over which apps I was using back then and of those which I'm still using now but also what new apps I'm using.  It feels like Android and the apps available for it advance at quite a slow pace so it'll be interesting to see the differences between the two blog posts and see what's changed in the last 16 months or so.  I've also updated from Android 2.2 to android 4.2.2 in that time.

Back then I was using the following list of apps regularly:
  • Angry Birds (Game)
  • Barcode Scanner (Bar Code Scanner)
  • eBuddy (Instant Messenger)
  • ES File Explorer (File Manager)
  • Google Reader (Feed Reader)
  • Maps (Navigation)
  • RAC Traffic (Traffic/Navigation)
  • Scrobble Droid (Social Music)
  • Skype (Instant Messenger and VOIP)
  • Todo List Widget (Productivity)
  • TweetDeck (Social Client)
  • Youtube (Video)

AccuWeather (Weather)
Back in the old days of Android 2.2 I was running an HTC Desire with Sense and there was a really good weather widget in sense which meant I didn't bother using a weather app.  Now I'm using Android 4.2.2 on a Nexus 4 and running the stock Google image that widget is no longer available.  I find the AccuWeather app to be a good alternative.  I also keep a mobile bookmark handy for the BBC weather page for my local area.

Amazon (Shopping)
I'm quite surprised that I've started using more dedicated apps.  I suppose they've got better over the years and there is a wider selection available.  If I used Amazon on my HTC Desire I would have done so via the mobile web browser interface but these days I tend to use the app instead.  That's probably got a lot to do with the fact I've got loads more storage space for apps on the new phone though.

Angry Birds (and Bad Piggies)
So I still play Angry Birds which is either a testament to the game or a worrying sign I'm slightly addicted.  A bit of both perhaps.  Now though, there's a whole series of these games to wade through as well as the spin-off Bad Piggies game which I also quite like.

BBC iPlayer and Media Player (TV)
It's not very often I use it but it's still handy to have installed.  It might get even more useful if I ever get round to buying one of the over-priced slimport adapters so I'd be able to pump out HD content from my phone over HDMI.

Bubble UPnP (DLNA Client)
This can come in really handy from time to time either to stream content from my NAS onto the phone but more often to use it as a DLNA controller for other devices in the house.

Business Calendar Free (Calendaring)
The standard Google calendaring app and widget leaves quite a lot of room for improvement in my opinion, hence I find this little free app to be a better alternative.

Chrome (Web Browser)
A default app on the Google Android install but I thought I'd mention it here anyway since it's a huge improvement over any other browser I've used on a mobile device (I've got Firefox installed as well but it's not as good in my opinion) and much better than the previous Android browser I was using.

Chrome to Phone (Link Sharing)
This is a great little app that allows you to ping links from your Chrome web browser (with an installed extension) straight over to your mobile phone.  Sadly, it only works in one direction though so you can't go from phone to desktop although there are third-party scripts that allow you to do that as well.

Dropbox (Cloud Storage)
Another app I don't use a huge amount but it's useful to be able to access my dropbox files on my phone if I need to.

ES File Explorer (File Management)
An app I've been using more or less since day 1 when I got my first Android device.  I think it's still the best file manager on the app store, look no further.

Facebook (Social Networking)
I've started to use the dedicated Facebook app a little more recently than I ever did before.  That's partly because it offers more functionality than I had available before (via either the web interface or Tweetdeck) and partly because Tweetdeck is going away soon and will no longer be available on Android.

Feedly (RSS Reader)
An app that Google forced me to discover recently because they appear to have got fed up with not making any money from Google Reader.  This appears right now to be the best alternative solution available on both Android and the web.  The Feedly producers currently use the Google API and back end for RSS reading but they are in the process of writing their own and are promising a seamless move when Google finally pull Reader in a few months time.  I'm sure more of these apps will pop up in the future now the market has opened up so it'll be interesting to see what I'm using next time I get around to writing one of these blog posts.

Flickr (Photography)
I use this app to upload the occasional photo I take on my mobile.  However, it's also one of those go to apps that I tend to read through more or less every day to see what's new on there.  Much like the Flickr web site itself, it's not exactly radical these days and passes for just about usable but if you're a Flickr user it's the best option I've found on Android if you want an app.

Hotmail (Email)
Previously I always used to use the in-built email app from HTC sense which was compatible with hotmail.  However, that's no longer available to me these days so I had to look around for an alternative when I got my new phone.  The official Hotmail app seems to do the job quite nicely.

LoveFilm (Movie Rental)
Again, another app that I don't use often and would previously have used the mobile web version instead.  However, this is handy to have sitting around for those times when someone recommends you a film and you want to shove it straight onto your rental list.

National Rail (Railways)
This can be incredibly handy for me in two situations, there's the obvious one where I'm using the rail network and want to plan routes, work out costs, see when trains are running and get live departure updates.  However, as the hubby of someone who uses the train network every day it can be really quite useful to spot when late arrivals home might occur or be armed with the latest information if the phone rings.

RAC (Traffic)
I'm still using this and finding it useful to check for traffic jams when I go on a long trip.  It's still not that great an app though so if there are better alternatives out there I'd love to hear about them.

Reader (RSS Reading from Google)
An app that I use every day, quite probably multiple times a day along with the Reader interface in my web browser.  Really very unfortunately indeed, Google have decided to think better of the Reader app and it's going away this year.

Rocket Player (Music)
I don't use Spotify, Google Play Music or other streaming services as I have a preference to copy my own music onto my device and listen to it locally.  I'd be doing this anyway if I were on a plane or something so I don't really see the point in streaming.  Rocket Player seems to be quite a nice app to play your music through.  It comes in a few different forms from a free version (which I use) to unlocking more features with paid versions.  It's simple yet functional and with their equaliser turned on it makes your music sound pretty decent (as far as phone + headphones go) too.

 Scrobble Droid (Social Music)
Another app I'm still using from the old days and more or less since day 1 on my first Android phone.  This simple little app works with music players (Rocket Player is compatible with it) and allows you to scrobble the tracks you're playing on your phone.  If you're not connected to a network then it'll save them up and scrobble when you make a connection instead.

Squeezer (Logitech Squeezebox Controller)
Unfortunately Logitech have neglected the Squeezebox brand since buying them a few years ago and that seems to be pervasive throughout everything Squeezebox including the Android app.  I'm unable to install the official app on the recent version of Android I have but fortunately someone has written an alternative in Squeezer.  I'm glad they did, I think they've made a better job of it than the official app.  Sure, it doesn't have half of the whizzy fancy features you get on the official app but given I don't use the vast majority of those I really don't care and would recommend anyone with a Squeezebox to give Squeezer a try instead.

Skype (Instance Messaging and VOIP)
Another app I've been using for ages on my phone.  It's great to finally be able to do video messaging now I have a front-facing camera too.  Also, the introduction of logging into Skype via your MSN account and compatibility with Messenger means I've stopped using the eBuddy app.

Stickman Games (Gaming)
Most of these are quite fun and generally absolutely hilarious.  I'd recommend anyone with a bit of time to kill to check out Play for the various stickman games, they range from golf to cliff diving with things like skiing, wingsuits and base jumping included in there too.

As a result of ceasing use of eBuddy I've started using the official Google Talk app instead which seems to be much better all-round than eBuddy for things like battery usage when staying logged into the app permanently.

Todo List Widget (Productivity)
This widget (or set of widgets) is still perfect for a really simple to do list on your phone.  It allows you to add different sized widgets to the desktop (which is a bit superfluous these days since you can resize widgets) and then simply add an item, remove an item and give the list a title.  All very easy.  All very simple.  Job done, don't need any more than that. 

Tweetdeck (Social Client)
Much like Google Reader this is an app I use every day or multiple times a day for reading and managing Twitter and to a lesser extent Facebook.  Another similarity is the announcement of the impending death of Tweetdeck on Android.  A couple of months ago I was looking around for Twitter clients as I know there's a huge range out there and was wondering if there was anything better than Tweetdeck.  I ended up installing 10 different apps to check them all out individually.  However, there wasn't anything I liked nearly as much as Tweetdeck (although one or two came close) because they all seem to rely so much on a huge amount of navigation and taps on the screen to see any content vs Tweetdeck's simple swipe-column interface.  I'm really apprehensive about what's next for Twitter on my phone.

YouTube (Video)
Another standard app, but it's still there on my phone.

So it seems in general I'm using a mixture of the same old apps I've always used, mostly because nothing better has come along rather than not trying to find new apps but I guess I'm probably a little stuck in my ways with the ones I'm using too.  There's definitely a theme of using more apps rather than mobile versions of web sites on my phone now which I put down to having loads more storage space available for installing them.  The list of apps is clearly quite a lot longer though so I'm using a lot more apps than I ever did as I've become a little more comfortable with using a touch screen mobile device and dipping in and out for content more regularly than I used to before my first smart phone.

Tuesday, 9 April 2013

Nexus 4

This post could as easily have been titled "buying my first mobile phone".  Yes, it's 2013 and I've just shelled out for my first ever new mobile phone, buying the Google/LG Nexus 4 so I thought I'd put something down about switching phones.  Prior to this as regular readers will know, I had an HTC Desire (which was actually a work phone), before that a Nokia N73 (which I bought second hand from eBay).  So I don't have a particularly prolific phone history, I tend to wait for what I want after researching long and hard and buying something that will last me several years (I'd say I change every 3 years or so).  With the price and feature set of the Nexus 4 as they are, it was a no-brainer next move for me so after the early rush on stock saw them getting sold out everywhere I waited and ordered when they came back in stock (on Feb 4th) and took delivery the very next day in spite of the 2 week wait Google were advertising for delivery.  Having had the phone a couple of months now, I'd say I've got used to it so now seems like the right time to talk about it.

So why the move to a new phone this time?
Basically, it wasn't at all driven by the Nexus 4, it was all about the HTC Desire.  I had been running non-standard firmware versions on it for quite a while and got fed up of the various instabilities in them, or out of date Android versions (for the more stable versions) but the real sticking point of the HTC Desire is that 150MB user space for apps.  That was OK back in the day, when apps were small and relatively few and far between in Android.  However, it quickly became far too little and I was either constantly battling to have the minimal set of apps I needed installed or copying them out to SD card in a custom firmware and suffering the slow-down consequences of doing that.

OK, so I decided I needed a new phone but why the Nexus 4?
Price.  Simple.  End of.  Having been lucky enough to have had the HTC Desire since it first came out, I've been used to a high end smart phone for quite a while.  A lot of the cheaper phones on the market today aren't really much better than the HTC Desire even now, so I needed to look high end.  Obvious choices were other top notch phones from HTC or perhaps a move to Samsung for the Galaxy S3.  However, looking at the high end market as it is right now, you just can't beat the Nexus 4 for "bang for your buck".  It's a high end phone offering all the features of the S3 and other similar phones but at a whisker over half of the price, so no contest.

What do I like about the Nexus 4?
I'm going to split this into two parts.  Software and then Hardware.

Well it's Android, and being a Google phone it's a bang up to date version too.    I can still use all the same apps I know and love from the HTC Desire (with the odd change of widget here and there as I've moved away from HTC Sense, obviously) but now I can have them all installed at the same time and running and comparative speed.  There are three things that come to mind when talking about the software differences.  First, the gesture typing keyboard seems really quite adequate to me; I was a Swype user before now but I've not even been remotely tempted to install Swype on the Nexus 4 as the gesture typing from Google more or less exactly replicates the experience with a few subtle differences.  Second, Google Now really does seem very clever indeed; within a couple of days it had worked out where I live and where I work so I get traffic updates before I make the journey; I get similar information if I've travelled somewhere I don't usually go; recently on holiday in Spain, Now welcomed me with local weather information, an estimate of the time back to the airport, a list of local restaurants and attractions, etc.  Third, the camera software completely blows me away; yes there's the photosphere camera which is a nice toy but the in-built ability to do time lapse videos, HDR pictures, panoramic pictures (which get stitched automatically) as well as all the various scene modes, editing facilities, location and social functions they've put in the app make it absolutely first class - my two cameras (a compact that is a few years old and an SLR) get absolutely nowhere near this level of functionality.

Hardware wise, the tech specs speak for themselves being dual core, 2GB RAM, nice screen, camera, etc.  Speaking of camera, I've been particularly impressed by the quality of the pictures and video from it against what I had before with the HTC Desire in addition to the camera software I've already talked about.  I could go on and on about what I like, the list is really long since it's more or less bigger, better and more powerful than the HTC Desire as you'd expect being released 3 years later.

So what's not to like?
If I was an Apple fan-boy I'd say "it's not an iPhone" but more seriously there are probably two downsides that I can think of.  You know it's coming... battery life.  Until humankind invents a more compact version of storing more electricity then battery life is always going to suck.  I can get 2 days from the phone with light usage but under heavier load I can get 1 day out of it.  The other thing is the size.  Preferably, I'd like to be able to break the laws of physics somehow and have a phone with a gorgeous massive screen that is absolutely tiny, unrealistic at the moment.  It is bigger than the HTC Desire and I would say, while thin, is still quite a large phone.