Wednesday 3 December 2008

N-Gram M-Adness

One of the main basic concepts of Natural Language Processing is the model of n-grams. It's the splitting of a sequence into n-subsequences, e.g.

(1) Now the lord once decided to set off for the mountain where the man lives

For n = 1 (unigram) the sentence is splitted into:

(n = 1) [ [Now] [the] [lord] [once] [decided] [to] [set] [off] [for] [the] [mountain] [where] [the] [man] [lives] ]

For n = 2 (bigram) the sentence is splitted into:

(n = 2) [ [Now the] [the lord] [lord once] [once decided] [decided to] [to set] [set off] [off for] [for the] [the mountain] [mountain where] [where the] [the man] [man lives] ]

For n = 3 (trigram) the sentence is splitted into:

(n = 3) [ [Now the lord] [the lord once] [lord once decided] [once decided to] [decided to set] [to set off] [set off for] [off for the] [for the mountain] [the mountain where] [mountain where the] [where the man] [the man lives] ]

Okay you get the idea. The sequence of words "w1, ..., wk" is splitted into "wk and wk-1" for bigrams, "wk and wk-1, wk-2" for trigrams and "wk and wk-n+1, ... wk-1" in general.

Here's the relevant Python code for making n-grams:

def makeNGrams(inpStr, n):
    token = inpStr.split()
    nGram = []
    for i in range(len(token)):
        if i+n > len(token):
            break
        nGram.append(token[i:n+i])
    return nGram


Or a bit more condense:

def makeNGrams(inpStr, n):
    inpStr = inpStr.split()
    return [inpStr[i:n+i] for i in range(len(inpStr)) if len(inpStr)>=i+n]


Why do you need this?

1. Machine Learning uses n-gram models to learn and induce rules from strings.
2. Probabilistic models use n-grams for spell checking and correcting misspelled words.
3. Compression of data.
4. Optical character recognition (OCR), Machine Translation (MT) and Intelligent Character Recognition (ICR) use n-grams to compute the probability of a word sequence or generally a pattern sequence.
5. Identify the language of a text (demo here)
6. Identify the species given a DNA sample.

For example you can compute the probability of a sequence by multipling all previous probabilities: P(wk|w1, ..., wk-1) but if one of these previous sequences is zero, the whole expression will be zero too. This is a huge problem, since these long sequences are hardly ever seen in corpora, even if you take the internet, e.g. "The world, as we know it, will be changed by the pollution of the environment". Therefore we only take the direct predecessor by using an n-gram model and can estimate the probability. Another application for n-grams can be found in Part of Speech tagging and probabilistic disambiguation of tags, e.g. the probability of "book/NN the/DT flight/NN" versus the probability "book/VB the/DT flight/NN".

I wrote a very simple program to predict the next word given a sequence of words in a corpus, e.g. input: "I will eat"; output: "fish" you can find it here.

Another program concering n-grams, which I wrote, is available here. It extracts proper nouns, e.g. "New York City" from English texts.

No comments: