Monday 22 December 2008

Python Madness

My Night at 3 AM: Hacking Python code to learn the language. That's a life! Here's what I've done so far:

1. Decimal to dual conversion:

import sys

def bd(x):
n = []
if x < 0:
return "Positive integer required"
elif x == 0:
return [0]
else:
while x > 0:
n.insert(0,x%2)
x = x/2
bd(x)
return n

if __name__ == "__main__":
try:
number = int(raw_input("Number: "))
print bd(number)
except ValueError:
sys.stderr.write("Integer required\n")


2. Basic truthtables:

def logicalAnd():
for valueOne in range(2):
for valueTwo in range(2):
print "%d %d %d"%(valueOne, valueTwo, valueOne and valueTwo)

def logicalOr():
for valueOne in range(2):
for valueTwo in range(2):
print "%d %d %d"%(valueOne, valueTwo, valueOne or valueTwo)

def logicalConditional():
for valueOne in range(2):
for valueTwo in range(2):
print "%d %d %d"%(valueOne, valueTwo, not valueOne or valueTwo)

def logicalBiconditional():
for valueOne in range(2):
for valueTwo in range(2):
print "%d %d %d"%(valueOne, valueTwo, valueOne is valueTwo)

if __name__ == "__main__":
op = raw_input("Connective: ")
if op == "and":
logicalAnd()
elif op == "or":
logicalOr()
elif op == "conditional":
logicalConditional()
elif op == "biconditional":
logicalBiconditional()
else:
print "Connective not known"


3. ASCII table. First column is the ASCII value, second column is the local interpretation, third column is the raw UTF-8 interpretation, fourth column is the hexadecimal value:

for element in xrange(256):
print "%s \t %s \t %s \t %s"%(element,%%
chr(element), str(tuple(chr(element)))%%
.strip("()'',"), chr(element).encode("hex"))


4. Perhaps a complicated dual to decimal program:

def reverseRange(input):
n = []
for i in range(len(input)-1,-1,-1):
n.append(i)
return n

def singleValues(input):
m = []
for i in input:
m.append(i)
return m

if __name__ == "__main__":
input = raw_input("Number: ")
rR = reverseRange(input)
sV = singleValues(input)
dN = 0
for i in range(len(sV)):
dN += int(sV[i])*2**int(rR[i])
print dN


This code works in 2.6.1

Sunday 21 December 2008

Ubuntu On Samsung NC10

Last month I bought Samsung's NC netbook and I'm astonished how cool it is. It's really handy if you travel a lot (I do!) and have a lot to code (I do!). Unfortunately, there are some issues which have to be solved first.

1. Touchpad problem: It totally sucks when you write something and your fat and overdimensioned nerdy hand (or what I like to call it: The Hand Of Code) gets even slightly over the touchpad. Therefore you totally need to disable the touchpad for a certain time. Fortunately, the gods of Ubuntu created a program called 'syndaemon' which exactly works like this. It's useful to put it to your autostart via System > Preferences > Sessions.

2. Excessive load cycle: It slowly kills your hard drive. So better follow these instructions to set the correct values. It seems that there are still issues even if you've changed the options.


For further information you should check out:

Ubuntu on the Samsung NC10
Linux on the Samsung NC10
The Ubuntu NC10 Community Documentation

Friday 19 December 2008

Natural Language Processing Online Applications

I want to present an interesting link list with online and free to use interactive NLP related applications:

1. XLE Web Interface allows you to parse sentences of German, English, Norwegian, Welsh, Malagasy and Arabic. You'll get a very detailed parse tree and the functional structure of the sentence, for "This is madness!" you'd get:

Free Image Hosting at www.ImageShack.us


2. Wortschatz Leipzig is a German application that crawls the web for a word and returns a detailed analysis of the word frequency, collocations and semantic relations. The word-graphs are most interesting, e.g. the graph for "Humbug" (German for "rubbish"):

Free Image Hosting at www.ImageShack.us


3. WordNet is a large lexical database for English, e.g. "house" would show following interpretations:

Free Image Hosting at www.ImageShack.us


4. Answerbus is a search engine like Google or Yahoo but with semantics! You can ask natural questions like "Who killed JFK?" and will (perhaps) get the answer "Oswald killed JFK". Perhaps... because the system actually sucks and you can easily outmaneuver it. Another search engine is START, which sucks too.

5. Wordfall is an awesome linguistic game! It's like Tetris but instead of blocks you have to match words to their constituents. Look:

Free Image Hosting at www.ImageShack.us


6. Wortwarte is a German site about neologisms in the media. They are collected and sorted.

7. A cool German chatbot called ELBOT. It would definitely pass my Turing Test.

8. Think of a thing and 20Q will read your mind by asking 20 questions.

9. Machine Translation is one of the prime disciplines of NLP. Everyone knows Babelfish. It's not only a translator in the Hitchhiker's Guide but also an online translator like Google Translation.

10. TextCat is a language guesser based on an n-gram Perl script. Another and better one would be the XRCE language guesser.

Wednesday 3 December 2008

N-Gram M-Adness

One of the main basic concepts of Natural Language Processing is the model of n-grams. It's the splitting of a sequence into n-subsequences, e.g.

(1) Now the lord once decided to set off for the mountain where the man lives

For n = 1 (unigram) the sentence is splitted into:

(n = 1) [ [Now] [the] [lord] [once] [decided] [to] [set] [off] [for] [the] [mountain] [where] [the] [man] [lives] ]

For n = 2 (bigram) the sentence is splitted into:

(n = 2) [ [Now the] [the lord] [lord once] [once decided] [decided to] [to set] [set off] [off for] [for the] [the mountain] [mountain where] [where the] [the man] [man lives] ]

For n = 3 (trigram) the sentence is splitted into:

(n = 3) [ [Now the lord] [the lord once] [lord once decided] [once decided to] [decided to set] [to set off] [set off for] [off for the] [for the mountain] [the mountain where] [mountain where the] [where the man] [the man lives] ]

Okay you get the idea. The sequence of words "w1, ..., wk" is splitted into "wk and wk-1" for bigrams, "wk and wk-1, wk-2" for trigrams and "wk and wk-n+1, ... wk-1" in general.

Here's the relevant Python code for making n-grams:

def makeNGrams(inpStr, n):
    token = inpStr.split()
    nGram = []
    for i in range(len(token)):
        if i+n > len(token):
            break
        nGram.append(token[i:n+i])
    return nGram


Or a bit more condense:

def makeNGrams(inpStr, n):
    inpStr = inpStr.split()
    return [inpStr[i:n+i] for i in range(len(inpStr)) if len(inpStr)>=i+n]


Why do you need this?

1. Machine Learning uses n-gram models to learn and induce rules from strings.
2. Probabilistic models use n-grams for spell checking and correcting misspelled words.
3. Compression of data.
4. Optical character recognition (OCR), Machine Translation (MT) and Intelligent Character Recognition (ICR) use n-grams to compute the probability of a word sequence or generally a pattern sequence.
5. Identify the language of a text (demo here)
6. Identify the species given a DNA sample.

For example you can compute the probability of a sequence by multipling all previous probabilities: P(wk|w1, ..., wk-1) but if one of these previous sequences is zero, the whole expression will be zero too. This is a huge problem, since these long sequences are hardly ever seen in corpora, even if you take the internet, e.g. "The world, as we know it, will be changed by the pollution of the environment". Therefore we only take the direct predecessor by using an n-gram model and can estimate the probability. Another application for n-grams can be found in Part of Speech tagging and probabilistic disambiguation of tags, e.g. the probability of "book/NN the/DT flight/NN" versus the probability "book/VB the/DT flight/NN".

I wrote a very simple program to predict the next word given a sequence of words in a corpus, e.g. input: "I will eat"; output: "fish" you can find it here.

Another program concering n-grams, which I wrote, is available here. It extracts proper nouns, e.g. "New York City" from English texts.