Showing posts with label Computational Linguistics. Show all posts
Showing posts with label Computational Linguistics. Show all posts

Sunday, 13 December 2009

Zipf's law

Last friday I gave presentation on Zipf's law primarily concerned with processing frequency lists and spectra in R and zipfR. The scripts contain two Python programs for extracting frequencies out of NLTK's internal Gutenberg selection corpus and the section J of the ACL Anthology corpus. If you don't have access to the ACL, I provide the processed TFL and SPC files for both corpora in the ZIP file.

Download:
[Slides]
[Scripts]

Sunday, 11 October 2009

Some NLP-related Python code

1. A program that counts the Flesch Score of a text. Code here. Don't know if the syllables are computed correctly.

2. A program that searches a wordlist for minimal pairs. Code here. Example here. The format of the wordlist is restrictive and the minimal pairs are printed twice!

3. A program that obfuscates the input, which means that first and last letter are the same but everything in between is mixed around. Code here.

4. A program that constructs a tree from a file and searches for the common minimal ancestor of two nodes. Code here. Example here.

Saturday, 15 August 2009

Poor Networks, Neurons and Lookaheads

Syntactic networks bear similarities to biological networks since their levels are scale-free, i.e. the distribution of nodes and edges follow a power law (e.g. social networks), and small-world, i.e. most nodes can be reached by a relatively small number of steps (e.g. social networks):


From Wikipedia [EN] [ES]


A group of researchers at the Institute of Applied Linguistics in Beijing, China tried to find similarities between semantic and syntactic networks via a statistical approach and a treebank with semantic roles. Both networks are represented by small-world and scale-free graphs but differ in hierarchical structure, k-Nearest-Neighbour correlation and semantic networks tend to create longer paths, which makes it a poorer hierarchy in comparison to syntactic networks: Statistical properties of Chinese semantic networks


Temporal fluctations in speech are easily corrected by our brain. For decades this mechanism was a mystery. Two researches of the Hebrew University of Jerusalem, Israel described how neurons adjust to decode distorted sound perfectly. Although I don't understand this very technical paper, it'll perhaps provide new algorithms for speech processing: Time-Warp-Invariant Neuronal Processing

Another improvement for speech recognition and production was achieved by the Max Plank Society which developed a new mathematical model. It's based on the look-ahead assumption, i.e. our brain tries to estimate the most probable sound-sequence based on previous information, e.g. 'hot su...' = 'sun' > 'supper': Recognizing Sequences of Sequences

Wednesday, 20 May 2009

Language Guesser and OpenNLP Pipe

1. The first program I wrote estimates the language of a document, based on a simple statistical bigram model. It takes 5 arguments from commandline based on following syntax:

java Main trainfile_lang_1 trainfile_lang_2 trainfile_lang_3 trainfile_lang_4 testfile_lang

e.g.

java Main ./europarl/train/de/ep-01-de.train ./europarl/train/en/ep-01-en.train ./europarl/train/es/ep-01-es.train ./europarl/train/fr/ep-01-fr.train ./europarl/test/de/ep-01-de.test

and writes to the system standard output: "Die Sprache ist wahrscheinlich Deutsch." (The language is probably German.)

You see, this is very static and perhaps I'll make it more dynamic in the future. It was originally created to estimate the language of a protocol from the European parliament. You can download it here, but please be aware that the comments are in German.


2. The second program is more interesting. It takes a XML file with tags and everything and writes the sentences to a file in following format:

token TAB tag TAB chunk

e.g. for the sentence "Is this love?"

Is VBZ O
this DT B-NP
love NN O
? . O

It just takes the path to the XML file as commandline argument. To run the program you'll need 2 things. Ant and the OpenNLP models. You can download the program here

/*
* To ensure the functionality of the program, the models EnglishChunk.bin.gz,
* EnglishSD.bin.gz, EnglishTok.bin.gz, tag.bin.gz and tagdict.htm have to be
* in the models directory.
*
* IMPORTANT:
* The models have to be downloaded and placed in the models directory
* otherwise the program won't work. The download links can be found in
* ./model/readme.txt.
*
* Install:
*
* 1. Download the models at OpenNLP: http://opennlp.sourceforge.net/
* 2. Run Ant
* 3. Start the program with java -jar NLPipe.jar XMLfile
* 4. Optionally you can run the program with: ant -Dargs="XMLfile" run
* 5. Optionally you can archive the basedir with: ant tar
*/

Wednesday, 18 February 2009

Crying

Vacation! Last semester was very intense, no time for anything. Gosh, and now I'm trying to learn Java... It's far more complicated than imagined. Python was neat and easy - in comparison - but Java is far more potent, so they say. I don't quite understand why I've to write code which seems to be obsolete - at least for a beginner like me , e.g.

public class HelloWorld
{
public static void main(String argv[])
{
System.out.println("Hello World!");
}
}

is equivalent to

print "Hello World!"

in Python.

And next semester I'm going to die for sure... Here're my courses:

Formal Syntax: I'm more the Morphology, Phonology, Semantics type. Actually Syntax is the only thing which gives me problems.
Java: Well, it's going to be hard I'd say.
Logic: I like formal logic, really.
Artificial Intelligence: Very very interesting. Inference, neuronal networks, genetic algorithms and that stuff - you know...
Acoustic Phonetics: This sounds good: Reading spectrograms and get in touch with VoiceXML.

Friday, 19 December 2008

Natural Language Processing Online Applications

I want to present an interesting link list with online and free to use interactive NLP related applications:

1. XLE Web Interface allows you to parse sentences of German, English, Norwegian, Welsh, Malagasy and Arabic. You'll get a very detailed parse tree and the functional structure of the sentence, for "This is madness!" you'd get:

Free Image Hosting at www.ImageShack.us


2. Wortschatz Leipzig is a German application that crawls the web for a word and returns a detailed analysis of the word frequency, collocations and semantic relations. The word-graphs are most interesting, e.g. the graph for "Humbug" (German for "rubbish"):

Free Image Hosting at www.ImageShack.us


3. WordNet is a large lexical database for English, e.g. "house" would show following interpretations:

Free Image Hosting at www.ImageShack.us


4. Answerbus is a search engine like Google or Yahoo but with semantics! You can ask natural questions like "Who killed JFK?" and will (perhaps) get the answer "Oswald killed JFK". Perhaps... because the system actually sucks and you can easily outmaneuver it. Another search engine is START, which sucks too.

5. Wordfall is an awesome linguistic game! It's like Tetris but instead of blocks you have to match words to their constituents. Look:

Free Image Hosting at www.ImageShack.us


6. Wortwarte is a German site about neologisms in the media. They are collected and sorted.

7. A cool German chatbot called ELBOT. It would definitely pass my Turing Test.

8. Think of a thing and 20Q will read your mind by asking 20 questions.

9. Machine Translation is one of the prime disciplines of NLP. Everyone knows Babelfish. It's not only a translator in the Hitchhiker's Guide but also an online translator like Google Translation.

10. TextCat is a language guesser based on an n-gram Perl script. Another and better one would be the XRCE language guesser.

Wednesday, 3 December 2008

N-Gram M-Adness

One of the main basic concepts of Natural Language Processing is the model of n-grams. It's the splitting of a sequence into n-subsequences, e.g.

(1) Now the lord once decided to set off for the mountain where the man lives

For n = 1 (unigram) the sentence is splitted into:

(n = 1) [ [Now] [the] [lord] [once] [decided] [to] [set] [off] [for] [the] [mountain] [where] [the] [man] [lives] ]

For n = 2 (bigram) the sentence is splitted into:

(n = 2) [ [Now the] [the lord] [lord once] [once decided] [decided to] [to set] [set off] [off for] [for the] [the mountain] [mountain where] [where the] [the man] [man lives] ]

For n = 3 (trigram) the sentence is splitted into:

(n = 3) [ [Now the lord] [the lord once] [lord once decided] [once decided to] [decided to set] [to set off] [set off for] [off for the] [for the mountain] [the mountain where] [mountain where the] [where the man] [the man lives] ]

Okay you get the idea. The sequence of words "w1, ..., wk" is splitted into "wk and wk-1" for bigrams, "wk and wk-1, wk-2" for trigrams and "wk and wk-n+1, ... wk-1" in general.

Here's the relevant Python code for making n-grams:

def makeNGrams(inpStr, n):
    token = inpStr.split()
    nGram = []
    for i in range(len(token)):
        if i+n > len(token):
            break
        nGram.append(token[i:n+i])
    return nGram


Or a bit more condense:

def makeNGrams(inpStr, n):
    inpStr = inpStr.split()
    return [inpStr[i:n+i] for i in range(len(inpStr)) if len(inpStr)>=i+n]


Why do you need this?

1. Machine Learning uses n-gram models to learn and induce rules from strings.
2. Probabilistic models use n-grams for spell checking and correcting misspelled words.
3. Compression of data.
4. Optical character recognition (OCR), Machine Translation (MT) and Intelligent Character Recognition (ICR) use n-grams to compute the probability of a word sequence or generally a pattern sequence.
5. Identify the language of a text (demo here)
6. Identify the species given a DNA sample.

For example you can compute the probability of a sequence by multipling all previous probabilities: P(wk|w1, ..., wk-1) but if one of these previous sequences is zero, the whole expression will be zero too. This is a huge problem, since these long sequences are hardly ever seen in corpora, even if you take the internet, e.g. "The world, as we know it, will be changed by the pollution of the environment". Therefore we only take the direct predecessor by using an n-gram model and can estimate the probability. Another application for n-grams can be found in Part of Speech tagging and probabilistic disambiguation of tags, e.g. the probability of "book/NN the/DT flight/NN" versus the probability "book/VB the/DT flight/NN".

I wrote a very simple program to predict the next word given a sequence of words in a corpus, e.g. input: "I will eat"; output: "fish" you can find it here.

Another program concering n-grams, which I wrote, is available here. It extracts proper nouns, e.g. "New York City" from English texts.

Saturday, 31 May 2008

Semantic Web

Today an interesting article was posted on New Scientist about the Semantic Web - unfortunately, not available for me yet. Everybody's talking about Web 2.0 and nobody knows what it's all about. The real thing that's going on at the moment is the so called Semantic Web project.
Tim Berners-Lee, so to say the founder of the World Wide Web and W3C head, suggested a new dimension of the Internet; Please note that the Internet and the WWW are not the same thing but often used equally and so do I. He expressed it by following words:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A 'Semantic Web', which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The 'intelligent agents' people have touted for ages will finally materialize. (Wikipedia)

This is, by the way, one topic of my, hopefully, new course of study - computational linguistics. The idea is to categorise and connect ideas in a way that a machine can understand and interact on a (semi-)natural with humans, e.g. via meta-information:

Non-semantic web (Web 1.0 and Web 2.0):

<.item>Cat<./item>

Semantic web (part of Web 3.0):

<.animal Kingdom="Animalia" Phylum="Chordata" Class="Mammalia" Order="Carnivora" Family="Felidae" Genus="Felis">Cat<./animal> (Wikipedia)

This should create a Serendipity effect as many users already experience when they do 'wikihopping'. A very advanced usage would be that a user can now ask the web:

Person: "Oh, I have a lovely cat at home and could you please tell me the class of my cat?"
Semantic WWW: "Of course sir/madam. You're from Europe, Republic of Ireland so the most likely class would be Mammalia. May I show you some images for verification or do you want to know something more?"

or:

Person: "When did the Berlin wall go down?"
Semantic WWW: "In 1989. Do you want some videos related to the event?"

Because search engines are dumb and can't understand meaning, you'll receive often inaccurate answers:

1. BBC ON THIS DAY | 9 | 1989: The night the Wall came down
2. When did the Berlin Wall go down? - Blurtit
3. Why did the berlin wall come down? - Yahoo! Answers (Google)


The idea of meta data is, infact, not new. Via HTML, especially Dublin Core, you can realise meta data (via <.meta>) but almost nobody uses it because it is much more work to do and search engines do not use meta data to index websites anymore and many browsers don't care at all. This will also be one of the main problems for a global Semantic Web. Many coders/users just don't care about meta information. So you'll have to make them care with prescriptions or to make it interesting to use meta tags.