Wednesday, 20 May 2009

Language Guesser and OpenNLP Pipe

1. The first program I wrote estimates the language of a document, based on a simple statistical bigram model. It takes 5 arguments from commandline based on following syntax:

java Main trainfile_lang_1 trainfile_lang_2 trainfile_lang_3 trainfile_lang_4 testfile_lang


java Main ./europarl/train/de/ep-01-de.train ./europarl/train/en/ep-01-en.train ./europarl/train/es/ep-01-es.train ./europarl/train/fr/ep-01-fr.train ./europarl/test/de/ep-01-de.test

and writes to the system standard output: "Die Sprache ist wahrscheinlich Deutsch." (The language is probably German.)

You see, this is very static and perhaps I'll make it more dynamic in the future. It was originally created to estimate the language of a protocol from the European parliament. You can download it here, but please be aware that the comments are in German.

2. The second program is more interesting. It takes a XML file with tags and everything and writes the sentences to a file in following format:

token TAB tag TAB chunk

e.g. for the sentence "Is this love?"

this DT B-NP
love NN O
? . O

It just takes the path to the XML file as commandline argument. To run the program you'll need 2 things. Ant and the OpenNLP models. You can download the program here

* To ensure the functionality of the program, the models EnglishChunk.bin.gz,
* EnglishSD.bin.gz, EnglishTok.bin.gz, tag.bin.gz and tagdict.htm have to be
* in the models directory.
* The models have to be downloaded and placed in the models directory
* otherwise the program won't work. The download links can be found in
* ./model/readme.txt.
* Install:
* 1. Download the models at OpenNLP:
* 2. Run Ant
* 3. Start the program with java -jar NLPipe.jar XMLfile
* 4. Optionally you can run the program with: ant -Dargs="XMLfile" run
* 5. Optionally you can archive the basedir with: ant tar