Apologies to the
tl;dr brigade, this is going to be a long one...
For a number of years I've
been quietly working away with IBM research on our speech to text
programme. That is, working with a set of algorithms that ultimately
produce a system capable of listening to human speech and
transcribing it into text. The concept is simple, train a system for
speech to text - speech goes in, text comes out. However, the
process and algorithms to do this are extremely complicated from just
about every way you look at it – computationally, mathematically,
operationally, evaluationally, time and cost. This is a completely
separate topic and area of research from the similar sounding text to
speech systems that take text (such as this blog) and read it aloud
in a computerised voice.
Whenever I talk to people
about it they always appear fascinated and want to know more. The
same questions often come up. I'm going to address some of these
here in a generic way and leaving out those that I'm unable to talk
about here. I should also point out that I'm by no means a speech
expert or linguist but have developed enough of an understanding to
be dangerous in the subject matter and that (I hope) allows me to
explain things in a way that others not familiar with the field are
able to understand. I'm deliberately not linking out to the various
research topics that come into play during this post as the list
would become lengthy very quickly and this isn't a formal paper after
all, Internet searches are your friend if you want to know more.
I didn't know IBM did
that?
OK so not strictly a
question but the answer is yes, we do. We happen to be pretty good
at it as well. However, we typically use a company called Nuance as
our preferred partner.
People have often heard of
IBM's former product in this area called Via Voice for their desktop
PCs which was available until the early 2000's. This sort of
technology allowed a single user to speak to their computer for
various different purposes and required the user to spend some time
training the software before it would understand their particular
voice. Today's speech software has progressed beyond this to systems
that don't require any training by the user before they use it.
Current systems are trained in advance in order to attempt to
understand any voice.
What's required?
Assuming you have the
appropriate software and the hardware required to run it on then you
need three more things to build a speech to text system: audio,
transcripts and a phonetic dictionary of pronunciations. This sounds
quite simple but when you dig under the covers a little you realise
it's much more complicated (not to mention expensive) and the devil
is very much in the detail.
On the audial side you'll
need a set of speech recordings. If you want to evaluate your system
after it has been trained then a small sample of these should be kept
to one side and not used during the training process. This set of
audio used for evaluation is usually termed the held out set. It's
considered cheating if you later evaluate the system using audio that
was included in the training process – since the system has already
“heard” this audio before it would have a higher chance of
accurately reproducing it later. The creation of the held out set
leads to two sets of audio files, the held out set and the majority
of the audio that remains which is called the training set.
The audio can be in any
format your training software is compatible with but wave files are
commonly used. The quality of the audio both in terms of the digital
quality (e.g. sample rate) as well as the quality of the speaker(s)
and the equipment used for the recordings will have a direct bearing
on the resulting accuracy of the system being trained. Simply put,
the better quality you can make the input, the more accurate the
output will be. This leads to another bunch of questions such as but
not limited to “What quality is optimal?”, “What should I get
the speakers to say?”, “How should I capture the recordings?” -
all of which are research topics in their own right and for which
there is no one-size-fits-all answer.
Capturing the audio is one
half of the battle. The next piece in the puzzle is obtaining well
transcribed textual copies of that audio. The transcripts should
consist of a set of text representing what was said in the audio as
well as some sort of indication of when during the audio a speaker
starts speaking and when they stop. This is usually done on a
sentence by sentence basis, or for each utterance as they are known.
These transcripts may have a certain amount of subjectivity
associated with them in terms of where the sentence boundaries are
and potentially exactly what was said if the audio wasn't clear or
slang terms were used. They can be formatted in a variety of
different ways and there are various standard formats for this
purpose from an XML DTD through to CSV.
If it has not already
become clear, creating the transcription files can be quite a skilled
and time consuming job. A typical industry expectation is that it
takes approximately 10 man-hours for a skilled transcriber to produce
1 hour of well formatted audio transcription. This time plus the
cost of collecting the audio in the first place is one of the factors
making speech to text a long, hard and expensive process. This is
particularly the case when put into context that most current
commercial speech systems are trained on at least 2000+ hours of
audio with the minimum recommended amount being somewhere in the
region of 500+ hours.
Finally, a phonetic
dictionary must either be obtained or produced that contains at least
one pronunciation variant for each word said across the entire corpus
of audio input. Even for a minimal system this will run into tens of
thousands of words. There are of course, already phonetic
dictionaries available such as the Oxford English Dictionary that
contains a pronunciation for each word it contains. However, this
would only be appropriate for one regional accent or dialect without
variation. Hence, producing the dictionary can also be a long and
skilled manual task.
What does the software
do?
The simple answer is that
it takes audio and transcript files and passes them through a set of
really rather complicated mathematical algorithms to produce a model
that is particular to the input received. This is the training
process. Once system has been trained the model it generates can be
used to take speech input and produce text output. This is the
decoding process. The training process requires lots of data and is
computationally expensive but the model it produces is very small and
computationally much less expensive to run. Today's models are
typically able to perform real-time (or faster) speech to text
conversion on a single core of a modern CPU. It is the model and
software surrounding the model that is the piece exposed to users of
the system.
Various different steps
are used during the training process to iterate through the different
modelling techniques across the entire set of training audio provided
to the trainer. When the process first starts the software knows
nothing of the audio, there are no clever boot strapping techniques
used to kick-start the system in a certain direction or pre-load it
in any way. This allows the software to be entirely generic and work
for all sorts of different languages and quality of material.
Starting in this way is known as a flat start or context independent
training. The software simply chops up the audio into regular
segments to start with and then performs several iterations where
these boundaries are shifted slightly to match the boundaries of the
speech in the audio more closely.
The next phase is context
dependent training. This phase starts to make the model a little
more specific and tailored to the input being given to the trainer.
The pronunciation dictionary is used to refine the model to produce
an initial system that could be used to decode speech into text in
its own right at this early stage. Typically, context dependent
training, while an iterative process in itself, can also be run
multiple times in order to hone the model still further.
Another optimisation that
can be made to the model after context dependent training is to apply
vocal tract length normalisation. This works on the theory that the
audibility of human speech correlates to the pitch of the voice, and
the pitch of the voice correlates to the vocal tract length of the
speaker. Put simply, it's a theory that says men have low voices and
women have high voices and if we normalise the wave form for all
voices in the training material to have the same pitch (i.e. same
vocal tract length) then audibility improves. To do this an
estimation of the vocal tract length must first be made for each
speaker in the training data such that a normalisation factor can be
applied to that material and the model updated to reflect the change.
The model can be thought
of as a tree although it's actually a large multi-dimensional matrix.
By reducing the number of dimensions in the matrix and applying
various other mathematical operations to reduce the search space the
model can be further improved upon both in terms of accuracy, speed
and size. This is generally done after vocal tract length
normalisation has taken place.
Another tweak that can be
made to improve the model is to apply what we call discriminative
training. For this step the theory goes along the lines that all of
the training material is decoded using the current best model
produced from the previous step. This produces a set of text files.
These text files can be compared with those produced by the human
transcribers and given to the system as training material. The
comparison can be used to inform where the model can be improved and
these improvements applied to the model. It's a step that can
probably be best summarised by learning from its mistakes, clever!
Finally, once the model
has been completed it can be used with a decoder that knows how to
understand that model to produce text given an audio input. In
reality, the decoders tend to operate on two different models. The
audio model for which the process of creation has just been roughly
explained; and a language model. The language model is simply a
description of how language is used in the specific context of the
training material. It would, for example, attempt to provide insight
into which words typically follow which other words via the use of
what natural language processing experts call n-grams. Obtaining
information to produce the language model is much easier and does not
necessarily have to come entirely from the transcripts used during
the training process. Any text data that is considered
representative of the speech being decoded could be useful. For
example, in an application targeted at decoding BBC News readers then
articles from the BBC news web site would likely prove a useful
addition to the language model.
How accurate is it?
This
is probably the most common question about these systems and one of
the most complex to answer. As with most things in the world of high
technology it's not simple, so the answer is the infamous “it
depends”. The short answer is that in ideal circumstances the
software can perform at near human levels of accuracy which equates
to in excess of 90% accuracy levels. Pretty good you'd think. It
has been shown that human performance is somewhere in excess of 90%
and is almost never 100% accuracy. The test for this is quite
simple, you get two (or more) people to independently transcribe some
speech and compare the results from each speaker, almost always there
will be a disagreement about some part of the speech (if there's
enough speech that is).
It's
not often that ideal circumstances are present or can even
realistically be achieved. Ideal would be transcribing a speaker
with a similar voice and accent to those which have been trained into
the model and they would speak at the right speed (not too fast and
not too slowly) and they would use a directional microphone that
didn't do any fancy noise cancellation, etc. What people are
generally interested in is the real-world situation, something along
the lines of “if I speak to my phone, will it understand me?”.
This sort of real-world environment often includes background noise
and a very wide variety of speakers potentially speaking into a
non-optimal recording device. Even this can be a complicated answer
for the purposes of accuracy. We're talking about free,
conversational style, speech in this blog post and there's a huge
different in recognising any and all words versus recognising a small
set of command and control words for if you wanted your phone to
perform a specific action. In conclusion then, we can only really
speak about the art of the possible and what has been achieved
before. If you want to know about accuracy for your particular
situation and your particular voice on your particular device then
you'd have to test it!
What words can it
understand? What about slang?
The range of understanding
of a speech to text system is dependent on the training material. At
present, the state of the art systems are based on dictionaries of
words and don't generally attempt to recognise new words for which an
entry in the dictionary has not been found (although these types of
systems are available separately and could be combined into a speech
to text solution if necessary). So the number and range of words
understood by a speech to text system is currently (and I'm
generalising here) a function of the number and range of words used
in the training material. It doesn't really matter what these words
are, whether they're conversational and slang terms or proper
dictionary terms, so long as the system was trained on those then it
should be able to recognise them again during a decode.
Updates and Maintenance
For
the more discerning reader, you'll have realised by now a fundamental
flaw in the plan laid out thus far. Language changes over time,
people use new words and the meaning of words changes within the
language we use. Text-speak is one of the new kids on the block in
this area. It would be extremely cumbersome to need to train an
entire new model each time you wished to update your previous one in
order to include some set of new language capability. The models
produced are able to be modified and updated with these changes
without the need to go back to a full standing start and training
from scratch all over again. It's possible to take your existing
model built from the set of data you had available at a particular
point in time and use this to bootstrap the creation of a new model
which will be enhanced with the new materials that you've gathered
since training the first model. Of course, you'll want to test and
compare both models to check that you have in fact enhanced
performance as you were expecting. This type of maintenance and
update to the model will be required to any and all of these types of
systems as they're currently designed as the structure and usage of
our languages evolve.
Conclusion
OK,
so not necessarily a blog post that was ever designed to draw a
conclusion but I wanted to wrap up by saying that this is an area of
technology that is still very much in active research and
development, and has been so for at least 40-50 years or more!
There's a really interesting statistic I've seen in the field that
says if you ask a range of people involved in this topic the answer
to the question “when will speech to text become a reality” then
the answer generally comes out at “in ten years time”. This
question has been asked consistently over time and the answer has
remained the same. It seems then, that either this is a really hard
nut to crack or that our expectations of such a system move on over
time. Either way, it seems there will always be something new just
around the corner to advance us to the next stage of speech
technologies.