Sunday 13 December 2009

Zipf's law

Last friday I gave presentation on Zipf's law primarily concerned with processing frequency lists and spectra in R and zipfR. The scripts contain two Python programs for extracting frequencies out of NLTK's internal Gutenberg selection corpus and the section J of the ACL Anthology corpus. If you don't have access to the ACL, I provide the processed TFL and SPC files for both corpora in the ZIP file.

Download:
[Slides]
[Scripts]

Sunday 11 October 2009

Some NLP-related Python code

1. A program that counts the Flesch Score of a text. Code here. Don't know if the syllables are computed correctly.

2. A program that searches a wordlist for minimal pairs. Code here. Example here. The format of the wordlist is restrictive and the minimal pairs are printed twice!

3. A program that obfuscates the input, which means that first and last letter are the same but everything in between is mixed around. Code here.

4. A program that constructs a tree from a file and searches for the common minimal ancestor of two nodes. Code here. Example here.

Monday 28 September 2009

Python-based RPN Evaluator

This program evaluates logic expressions out of a textfile with Reveresed Polish Notation (RPN) syntax.

Example world file:
wind
/sun
/rain
red

wind and red have the value of 1, sun and rain 0 since they are prefixed by "/".
Here's the syntax to run the program: "python log.py myworld.world".
It quits when an empty expression occurs.

Example usage:
C:\Python26>python log.py myworld.world
Logical Expression: rain sun &
0
Logical Expression: sun red |
1
Logical Expression: sun wind ^
True
Logical Expression: winter sun &
*** Error while evaluating: Bad name: 'winter'.
Logical Expression: sun red
*** Error while evaluating: Unbalanced expression: 'sun red'.
Logical Expression: sun red red |
*** Error while evaluating: Unbalanced expression: 'sun red red |'.
Logical Expression:

C:\Python26>



Find the source code here.

Friday 25 September 2009

Download all SMBC Comics

Simple regex-based bruteforce program to save all comics from http://www.smbc-comics.com/. You'll need http://commons.apache.org/io/ and my source code.

Tuesday 15 September 2009

Google Cheat Sheet 0.11

Wrote a Google Cheat Sheet: http://rapidshare.com/files/280485137/gcs.pdf

It's simple and contains every working function in Google Search, Groups, News, Calculator. What's missing? Query suggestions...

Saturday 15 August 2009

Poor Networks, Neurons and Lookaheads

Syntactic networks bear similarities to biological networks since their levels are scale-free, i.e. the distribution of nodes and edges follow a power law (e.g. social networks), and small-world, i.e. most nodes can be reached by a relatively small number of steps (e.g. social networks):


From Wikipedia [EN] [ES]


A group of researchers at the Institute of Applied Linguistics in Beijing, China tried to find similarities between semantic and syntactic networks via a statistical approach and a treebank with semantic roles. Both networks are represented by small-world and scale-free graphs but differ in hierarchical structure, k-Nearest-Neighbour correlation and semantic networks tend to create longer paths, which makes it a poorer hierarchy in comparison to syntactic networks: Statistical properties of Chinese semantic networks


Temporal fluctations in speech are easily corrected by our brain. For decades this mechanism was a mystery. Two researches of the Hebrew University of Jerusalem, Israel described how neurons adjust to decode distorted sound perfectly. Although I don't understand this very technical paper, it'll perhaps provide new algorithms for speech processing: Time-Warp-Invariant Neuronal Processing

Another improvement for speech recognition and production was achieved by the Max Plank Society which developed a new mathematical model. It's based on the look-ahead assumption, i.e. our brain tries to estimate the most probable sound-sequence based on previous information, e.g. 'hot su...' = 'sun' > 'supper': Recognizing Sequences of Sequences

Tuesday 28 July 2009

These little annoying and surprising things...

...concerning Python as language of "very clear syntax which emphasizes code readability" (Wikipedia):

A = Annoying == Anti-Python

1. The naming conventions of...:

a) ...package-files/magic members: __init__.py, def __del__, __name__
b) ...visibility modifiers: _protected and __private

So _ and __ in general. Really, why did Guido do this? Is there an explanation? Perhaps it's inherited from another language?

2. The verbose...:

a) ...object inheritance declaration of each class: class Standard(object)
b) ..."self"-reference of each class-value/-constructor/-function just to indicate that it's non-static: def compute(self, number), self.Radius, def __init__(self)


(3. Multiple inheritance: It's no coincidence that most languages don't support multiple inheritance. Normally, you don't need it and it is a trap which makes debugging almost impossible. It is definitely not a feature for a language which emphasizes code readability and clear syntax.)




S = Surprising


(Powerful ability to handle and process strings in general.)

1. Lambda/Annonymous functions: (lambda x, y : x + y)
2. Managed Attributes: property([fget[, fset[, fdel[, doc]]]])
3. Great modification abilities due to magic members/methods and type-emulation.
4. List comprehension, generator expressions and yield.
5. Function decorations. Java has this one too.
6. Localization module. It's neat and easy to localize your programs.
7. Parallel computing module!
8. Awesome network protocol capabilities.
9. Unit tests.
10. The best documantation I've ever seen.

Sunday 26 July 2009

Blagh

1. Blagh:
  • I read the Post "Crying" again and must say that Java is definitely my favourite programming language.
  • I totally lost every knowledge of Python which's a bad thing since I'll have to use it in a exam in a few months.
  • I got attached to Twitter. It's so convenient and you can watch it on the panel of my blog.
  • I read Russell's Introduction to Mathematical Philosophy. Very interesting.

2. Pathfinder:

I had to write a small Java program that implements Dijkstra's shortest path algorithm in order to find the shortest path between two cities. You can download the source code here. You can add nodes and edges to the data.txt and the user interface is console-based.


3. Imageboardsave:

Currently, I'm writing a program which downloads every image and Rapidshare URL in a thread of an imageboard. The program works for AnonIB at the moment and the user can specify how many pages he want to search. The program scans the page for threads and download the content to:

/threadID/picturefilename.something


and the Rapidshare-URLs to:

/threadID/rapidshare.txt


Subthreads, i.e. more than a one-page thread, are considered as well. It even has a graphical user interface programmed in Java Swing. Actually, it runs pretty good right now, but I don't want to release it yet.

Wednesday 20 May 2009

Language Guesser and OpenNLP Pipe

1. The first program I wrote estimates the language of a document, based on a simple statistical bigram model. It takes 5 arguments from commandline based on following syntax:

java Main trainfile_lang_1 trainfile_lang_2 trainfile_lang_3 trainfile_lang_4 testfile_lang

e.g.

java Main ./europarl/train/de/ep-01-de.train ./europarl/train/en/ep-01-en.train ./europarl/train/es/ep-01-es.train ./europarl/train/fr/ep-01-fr.train ./europarl/test/de/ep-01-de.test

and writes to the system standard output: "Die Sprache ist wahrscheinlich Deutsch." (The language is probably German.)

You see, this is very static and perhaps I'll make it more dynamic in the future. It was originally created to estimate the language of a protocol from the European parliament. You can download it here, but please be aware that the comments are in German.


2. The second program is more interesting. It takes a XML file with tags and everything and writes the sentences to a file in following format:

token TAB tag TAB chunk

e.g. for the sentence "Is this love?"

Is VBZ O
this DT B-NP
love NN O
? . O

It just takes the path to the XML file as commandline argument. To run the program you'll need 2 things. Ant and the OpenNLP models. You can download the program here

/*
* To ensure the functionality of the program, the models EnglishChunk.bin.gz,
* EnglishSD.bin.gz, EnglishTok.bin.gz, tag.bin.gz and tagdict.htm have to be
* in the models directory.
*
* IMPORTANT:
* The models have to be downloaded and placed in the models directory
* otherwise the program won't work. The download links can be found in
* ./model/readme.txt.
*
* Install:
*
* 1. Download the models at OpenNLP: http://opennlp.sourceforge.net/
* 2. Run Ant
* 3. Start the program with java -jar NLPipe.jar XMLfile
* 4. Optionally you can run the program with: ant -Dargs="XMLfile" run
* 5. Optionally you can archive the basedir with: ant tar
*/

Friday 10 April 2009

Recap

This is going to be a very boring semester. Java is really interesting but also complicated for a beginner. Compared to Python, it feels like a mature language and you have to think (more) about what you're doing. Formal Syntax is complex and time-consuming, especially for me, since I'm not into Syntax. I'm more interested in semantics and therefore Logic is quite fancy - although sometimes it's as boring as Math. Acoustic phonetics hasn't started yet and Artificial Intelligence is fun but often too superficial - we aren't concerned with Natural Language Processing. Hence I plan to do a small pragmatic series on AI and to recap what I'll learn in the course. I also took two courses in English: Translation into German (boring as hell) and English to the 1700 century (boring lecturer), but they aren't worth to mention.

Finally, there's no post in my blogroll which could wake my interest. Except - well, in a not so positive way - for a news on Eureka Alert. This is one of the cases which I'd entitle 'the most unspectacular findings - that are no findings because everybody already knows - in science'. The study says that music is culturally independent when it comes to convey (certain) emotions. Well, everyone who's listened to a song in another language knows this. The study goes one step further and investigates the influence of music from fundamentally different cultures: Language of music really is universal, study finds

Monday 16 March 2009

Points of Interest 16. March, 2009

1. Joint attention and the difference between we and I might be the key in order to understand language evolution.
My opinion is, that language evolved to set borders. The border of I; my person, my belongings, my needs and thoughts, my standing within society - in opposition to others.
The border of we; our tribe, our race, our territory, our hunting grounds, our ideals - in opposition to others.
Language is a cultural and psychological necessity to govern, identify, interact and define oneself if things are getting more complex. A biological necessity to define entities in time and space and to interact with them in a complex way on different planes. A way to optimize doing things, like hunting strategies, complex hierarchies, standings to other tribes: A Tale Without Episodes

2. A very strange article. As a computerlinguist I'm very interested in how they compute the evolution of a language. I can't imagine how this could work. There are several factors which seem to be just too random to be calculated. How did they trace the words? Did they use OCR to scan documents and match words? There is this particular paragraph, which I don't understand at all:

"Looking to the future, the less frequently certain words are used, the more likely they are to be replaced. Other simple rules have been uncovered - numerals evolve the slowest, then nouns, then verbs, then adjectives. Conjunctions and prepositions such as 'and', 'or', 'but' , 'on', 'over' and 'against' evolve the fastest, some as much as 100 times faster than numerals."

What does evolve mean here? Change? How can prepositions change? These things are bound by perception and are, in my opinion, as static as numerals: Scientists discover oldest words in the English language and predict which ones are likely to disappear in the future

3. I'm quite into morbid things; like mummies and anatomical stuff. Lately, Pink Tentacle wrote a post about monster mummies and living mummies - formerly known as living Buddhist monks. They've killed themselves (slowly) in a timespan of 3000 days and now are relics of a sort: Monster mummies of Japan

4. Does your language influence your preference in music? Language and music are associated: Why music sounds right - the hidden tones in our own speech

5. A very basic lecture about language held at Yale University. It covers fundamentals of linguistics - Pidgin & Creoles, language as human trait, language universals, (innate) language capacity, phonetics, morphology, syntax, semantics (in this order) and so on: How Do We Communicate?: Language in the Brain, Mouth and the Hands

Wednesday 18 February 2009

Crying

Vacation! Last semester was very intense, no time for anything. Gosh, and now I'm trying to learn Java... It's far more complicated than imagined. Python was neat and easy - in comparison - but Java is far more potent, so they say. I don't quite understand why I've to write code which seems to be obsolete - at least for a beginner like me , e.g.

public class HelloWorld
{
public static void main(String argv[])
{
System.out.println("Hello World!");
}
}

is equivalent to

print "Hello World!"

in Python.

And next semester I'm going to die for sure... Here're my courses:

Formal Syntax: I'm more the Morphology, Phonology, Semantics type. Actually Syntax is the only thing which gives me problems.
Java: Well, it's going to be hard I'd say.
Logic: I like formal logic, really.
Artificial Intelligence: Very very interesting. Inference, neuronal networks, genetic algorithms and that stuff - you know...
Acoustic Phonetics: This sounds good: Reading spectrograms and get in touch with VoiceXML.

Tuesday 17 February 2009

Points of Interest 17. February, 2009

1. 20 years prison or death sentence for translating the Qu'ran? Welcome to our illustrious humanistic and democratic circle, dear Afghanistan: The dangers of translation

2. Help to classify galaxies and work for science, now! You'll be shown pictures of galaxies and asked questions about their spiral arms and other features. A task which you can do better than a computer and will help astronomy: Galaxy Zoo 2

3. The 2009 Jōyō Kanji update is coming up and the two favourites are 俺 (ore, which means informal "I") and 誰 (dare, question pronoun "who?"). I think "dare" is quite common these days, even in formal context. Of course one can say donata which is more formal: Jōyō list to level up

4. Speech perception is much more than hearing sounds. There are several other senses than hearing involved: Read my lips: Using multiple senses in speech perception

5. Steven Pinker explains the critical points of his book "The Blank Slate":