N-Gram language models

Posted on Updated on

Language models (LM) define the most probable word sequences in your app. Ideally, a good LM should assign high probability to correct phrases and low probability to incorrect ones.

This way, if the acoustic model of your speech recognizer assigns similar probabilities to two phrases that sound pretty much the same, for example:

Speech technology rules

Speech enology rules

the LM can help to select one or the other.

N-grams basically compute the probability of each word depending on a sequence of N previous words. That is, in a 2-gram model, it would compute:

P(speech technology rules) = P(speech) * P(technology | speech) * P(rules | speech technology)

P(speech enology rules) = P(speech) * P(enology | speech) * P(rules | speech enology)

In this case P(speech technology rules) > P(speech enology rules) as if we have a good LM, P(technology | speech) will be higher than P(enology | speech).

How is it posible to calculate all these probabilities? Using a big amount of phrases, a linguistic corpus.

Building your own LM

In the book we use the speech recognizer provided by Google. Google uses an enormous amount of phrases (imagine all the information they have just from web searches). To build your own LM you would need also a big amount of data. Fortunatelly, you can obtain a lot of free text from ebooks and newspapers that are available on the Internet.

Lately there has also appeared an interesting initiative: the 1 billion word language modeling benchmark, which is available to you! Check it out here: https://code.google.com/p/1-billion-word-language-modeling-benchmark/

Integrating your LM in a speech recognizer

To build a speech recognizer for your apps that uses your brand new LM, check out Pocket Sphinx, a great open source tool for developers.

Find out more

If you want to learn more on the statistical foundations of language modelling, you will find this book very interesting:

Book: Foundations of statistical natural language processing
Press in the cover to go to web page
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s