Last friday I gave presentation on Zipf's law primarily concerned with processing frequency lists and spectra in R and zipfR. The scripts contain two Python programs for extracting frequencies out of NLTK's internal Gutenberg selection corpus and the section J of the ACL Anthology corpus. If you don't have access to the ACL, I provide the processed TFL and SPC files for both corpora in the ZIP file.
Sunday, 13 December 2009
Sunday, 11 October 2009
Some NLP-related Python code
1. A program that counts the Flesch Score of a text. Code here. Don't know if the syllables are computed correctly.
2. A program that searches a wordlist for minimal pairs. Code here. Example here. The format of the wordlist is restrictive and the minimal pairs are printed twice!
3. A program that obfuscates the input, which means that first and last letter are the same but everything in between is mixed around. Code here.
4. A program that constructs a tree from a file and searches for the common minimal ancestor of two nodes. Code here. Example here.
2. A program that searches a wordlist for minimal pairs. Code here. Example here. The format of the wordlist is restrictive and the minimal pairs are printed twice!
3. A program that obfuscates the input, which means that first and last letter are the same but everything in between is mixed around. Code here.
4. A program that constructs a tree from a file and searches for the common minimal ancestor of two nodes. Code here. Example here.
Labels:
Computational Linguistics,
Linguistics,
Programming,
Python
Monday, 28 September 2009
Python-based RPN Evaluator
This program evaluates logic expressions out of a textfile with Reveresed Polish Notation (RPN) syntax.
Example world file:
wind
/sun
/rain
red
wind and red have the value of 1, sun and rain 0 since they are prefixed by "/".
Here's the syntax to run the program: "python log.py myworld.world".
It quits when an empty expression occurs.
Example usage:
C:\Python26>python log.py myworld.world
Logical Expression: rain sun &
0
Logical Expression: sun red |
1
Logical Expression: sun wind ^
True
Logical Expression: winter sun &
*** Error while evaluating: Bad name: 'winter'.
Logical Expression: sun red
*** Error while evaluating: Unbalanced expression: 'sun red'.
Logical Expression: sun red red |
*** Error while evaluating: Unbalanced expression: 'sun red red |'.
Logical Expression:
C:\Python26>
Find the source code here.
Example world file:
wind
/sun
/rain
red
wind and red have the value of 1, sun and rain 0 since they are prefixed by "/".
Here's the syntax to run the program: "python log.py myworld.world".
It quits when an empty expression occurs.
Example usage:
C:\Python26>python log.py myworld.world
Logical Expression: rain sun &
0
Logical Expression: sun red |
1
Logical Expression: sun wind ^
True
Logical Expression: winter sun &
*** Error while evaluating: Bad name: 'winter'.
Logical Expression: sun red
*** Error while evaluating: Unbalanced expression: 'sun red'.
Logical Expression: sun red red |
*** Error while evaluating: Unbalanced expression: 'sun red red |'.
Logical Expression:
C:\Python26>
Find the source code here.
Friday, 25 September 2009
Download all SMBC Comics
Simple regex-based bruteforce program to save all comics from http://www.smbc-comics.com/. You'll need http://commons.apache.org/io/ and my source code.
Tuesday, 15 September 2009
Google Cheat Sheet 0.11
Wrote a Google Cheat Sheet: http://rapidshare.com/files/280485137/gcs.pdf
It's simple and contains every working function in Google Search, Groups, News, Calculator. What's missing? Query suggestions...
It's simple and contains every working function in Google Search, Groups, News, Calculator. What's missing? Query suggestions...
Saturday, 15 August 2009
Poor Networks, Neurons and Lookaheads
Syntactic networks bear similarities to biological networks since their levels are scale-free, i.e. the distribution of nodes and edges follow a power law (e.g. social networks), and small-world, i.e. most nodes can be reached by a relatively small number of steps (e.g. social networks):
A group of researchers at the Institute of Applied Linguistics in Beijing, China tried to find similarities between semantic and syntactic networks via a statistical approach and a treebank with semantic roles. Both networks are represented by small-world and scale-free graphs but differ in hierarchical structure, k-Nearest-Neighbour correlation and semantic networks tend to create longer paths, which makes it a poorer hierarchy in comparison to syntactic networks: Statistical properties of Chinese semantic networks
Temporal fluctations in speech are easily corrected by our brain. For decades this mechanism was a mystery. Two researches of the Hebrew University of Jerusalem, Israel described how neurons adjust to decode distorted sound perfectly. Although I don't understand this very technical paper, it'll perhaps provide new algorithms for speech processing: Time-Warp-Invariant Neuronal Processing
Another improvement for speech recognition and production was achieved by the Max Plank Society which developed a new mathematical model. It's based on the look-ahead assumption, i.e. our brain tries to estimate the most probable sound-sequence based on previous information, e.g. 'hot su...' = 'sun' > 'supper': Recognizing Sequences of Sequences
A group of researchers at the Institute of Applied Linguistics in Beijing, China tried to find similarities between semantic and syntactic networks via a statistical approach and a treebank with semantic roles. Both networks are represented by small-world and scale-free graphs but differ in hierarchical structure, k-Nearest-Neighbour correlation and semantic networks tend to create longer paths, which makes it a poorer hierarchy in comparison to syntactic networks: Statistical properties of Chinese semantic networks
Temporal fluctations in speech are easily corrected by our brain. For decades this mechanism was a mystery. Two researches of the Hebrew University of Jerusalem, Israel described how neurons adjust to decode distorted sound perfectly. Although I don't understand this very technical paper, it'll perhaps provide new algorithms for speech processing: Time-Warp-Invariant Neuronal Processing
Another improvement for speech recognition and production was achieved by the Max Plank Society which developed a new mathematical model. It's based on the look-ahead assumption, i.e. our brain tries to estimate the most probable sound-sequence based on previous information, e.g. 'hot su...' = 'sun' > 'supper': Recognizing Sequences of Sequences
Tuesday, 28 July 2009
These little annoying and surprising things...
...concerning Python as language of "very clear syntax which emphasizes code readability" (Wikipedia):
A = Annoying == Anti-Python
1. The naming conventions of...:
a) ...package-files/magic members: __init__.py, def __del__, __name__
b) ...visibility modifiers: _protected and __private
So _ and __ in general. Really, why did Guido do this? Is there an explanation? Perhaps it's inherited from another language?
2. The verbose...:
a) ...object inheritance declaration of each class: class Standard(object)
b) ..."self"-reference of each class-value/-constructor/-function just to indicate that it's non-static: def compute(self, number), self.Radius, def __init__(self)
(3. Multiple inheritance: It's no coincidence that most languages don't support multiple inheritance. Normally, you don't need it and it is a trap which makes debugging almost impossible. It is definitely not a feature for a language which emphasizes code readability and clear syntax.)
S = Surprising
(Powerful ability to handle and process strings in general.)
1. Lambda/Annonymous functions: (lambda x, y : x + y)
2. Managed Attributes: property([fget[, fset[, fdel[, doc]]]])
3. Great modification abilities due to magic members/methods and type-emulation.
4. List comprehension, generator expressions and yield.
5. Function decorations. Java has this one too.
6. Localization module. It's neat and easy to localize your programs.
7. Parallel computing module!
8. Awesome network protocol capabilities.
9. Unit tests.
10. The best documantation I've ever seen.
A = Annoying == Anti-Python
1. The naming conventions of...:
a) ...package-files/magic members: __init__.py, def __del__, __name__
b) ...visibility modifiers: _protected and __private
So _ and __ in general. Really, why did Guido do this? Is there an explanation? Perhaps it's inherited from another language?
2. The verbose...:
a) ...object inheritance declaration of each class: class Standard(object)
b) ..."self"-reference of each class-value/-constructor/-function just to indicate that it's non-static: def compute(self, number), self.Radius, def __init__(self)
(3. Multiple inheritance: It's no coincidence that most languages don't support multiple inheritance. Normally, you don't need it and it is a trap which makes debugging almost impossible. It is definitely not a feature for a language which emphasizes code readability and clear syntax.)
S = Surprising
(Powerful ability to handle and process strings in general.)
1. Lambda/Annonymous functions: (lambda x, y : x + y)
2. Managed Attributes: property([fget[, fset[, fdel[, doc]]]])
3. Great modification abilities due to magic members/methods and type-emulation.
4. List comprehension, generator expressions and yield.
5. Function decorations. Java has this one too.
6. Localization module. It's neat and easy to localize your programs.
7. Parallel computing module!
8. Awesome network protocol capabilities.
9. Unit tests.
10. The best documantation I've ever seen.
Sunday, 26 July 2009
Blagh
1. Blagh:
2. Pathfinder:
I had to write a small Java program that implements Dijkstra's shortest path algorithm in order to find the shortest path between two cities. You can download the source code here. You can add nodes and edges to the data.txt and the user interface is console-based.
3. Imageboardsave:
Currently, I'm writing a program which downloads every image and Rapidshare URL in a thread of an imageboard. The program works for AnonIB at the moment and the user can specify how many pages he want to search. The program scans the page for threads and download the content to:
/threadID/picturefilename.something
and the Rapidshare-URLs to:
/threadID/rapidshare.txt
Subthreads, i.e. more than a one-page thread, are considered as well. It even has a graphical user interface programmed in Java Swing. Actually, it runs pretty good right now, but I don't want to release it yet.
- I read the Post "Crying" again and must say that Java is definitely my favourite programming language.
- I totally lost every knowledge of Python which's a bad thing since I'll have to use it in a exam in a few months.
- I got attached to Twitter. It's so convenient and you can watch it on the panel of my blog.
- I read Russell's Introduction to Mathematical Philosophy. Very interesting.
2. Pathfinder:
I had to write a small Java program that implements Dijkstra's shortest path algorithm in order to find the shortest path between two cities. You can download the source code here. You can add nodes and edges to the data.txt and the user interface is console-based.
3. Imageboardsave:
Currently, I'm writing a program which downloads every image and Rapidshare URL in a thread of an imageboard. The program works for AnonIB at the moment and the user can specify how many pages he want to search. The program scans the page for threads and download the content to:
/threadID/picturefilename.something
and the Rapidshare-URLs to:
/threadID/rapidshare.txt
Subthreads, i.e. more than a one-page thread, are considered as well. It even has a graphical user interface programmed in Java Swing. Actually, it runs pretty good right now, but I don't want to release it yet.
Wednesday, 20 May 2009
Language Guesser and OpenNLP Pipe
1. The first program I wrote estimates the language of a document, based on a simple statistical bigram model. It takes 5 arguments from commandline based on following syntax:
java Main trainfile_lang_1 trainfile_lang_2 trainfile_lang_3 trainfile_lang_4 testfile_lang
e.g.
java Main ./europarl/train/de/ep-01-de.train ./europarl/train/en/ep-01-en.train ./europarl/train/es/ep-01-es.train ./europarl/train/fr/ep-01-fr.train ./europarl/test/de/ep-01-de.test
and writes to the system standard output: "Die Sprache ist wahrscheinlich Deutsch." (The language is probably German.)
You see, this is very static and perhaps I'll make it more dynamic in the future. It was originally created to estimate the language of a protocol from the European parliament. You can download it here, but please be aware that the comments are in German.
2. The second program is more interesting. It takes a XML file with tags and everything and writes the sentences to a file in following format:
token TAB tag TAB chunk
e.g. for the sentence "Is this love?"
Is VBZ O
this DT B-NP
love NN O
? . O
It just takes the path to the XML file as commandline argument. To run the program you'll need 2 things. Ant and the OpenNLP models. You can download the program here
/*
* To ensure the functionality of the program, the models EnglishChunk.bin.gz,
* EnglishSD.bin.gz, EnglishTok.bin.gz, tag.bin.gz and tagdict.htm have to be
* in the models directory.
*
* IMPORTANT:
* The models have to be downloaded and placed in the models directory
* otherwise the program won't work. The download links can be found in
* ./model/readme.txt.
*
* Install:
*
* 1. Download the models at OpenNLP: http://opennlp.sourceforge.net/
* 2. Run Ant
* 3. Start the program with java -jar NLPipe.jar XMLfile
* 4. Optionally you can run the program with: ant -Dargs="XMLfile" run
* 5. Optionally you can archive the basedir with: ant tar
*/
java Main trainfile_lang_1 trainfile_lang_2 trainfile_lang_3 trainfile_lang_4 testfile_lang
e.g.
java Main ./europarl/train/de/ep-01-de.train ./europarl/train/en/ep-01-en.train ./europarl/train/es/ep-01-es.train ./europarl/train/fr/ep-01-fr.train ./europarl/test/de/ep-01-de.test
and writes to the system standard output: "Die Sprache ist wahrscheinlich Deutsch." (The language is probably German.)
You see, this is very static and perhaps I'll make it more dynamic in the future. It was originally created to estimate the language of a protocol from the European parliament. You can download it here, but please be aware that the comments are in German.
2. The second program is more interesting. It takes a XML file with tags and everything and writes the sentences to a file in following format:
token TAB tag TAB chunk
e.g. for the sentence "Is this love?"
Is VBZ O
this DT B-NP
love NN O
? . O
It just takes the path to the XML file as commandline argument. To run the program you'll need 2 things. Ant and the OpenNLP models. You can download the program here
/*
* To ensure the functionality of the program, the models EnglishChunk.bin.gz,
* EnglishSD.bin.gz, EnglishTok.bin.gz, tag.bin.gz and tagdict.htm have to be
* in the models directory.
*
* IMPORTANT:
* The models have to be downloaded and placed in the models directory
* otherwise the program won't work. The download links can be found in
* ./model/readme.txt.
*
* Install:
*
* 1. Download the models at OpenNLP: http://opennlp.sourceforge.net/
* 2. Run Ant
* 3. Start the program with java -jar NLPipe.jar XMLfile
* 4. Optionally you can run the program with: ant -Dargs="XMLfile" run
* 5. Optionally you can archive the basedir with: ant tar
*/
Labels:
Academic,
Computational Linguistics,
Java,
Programming
Friday, 10 April 2009
Recap
This is going to be a very boring semester. Java is really interesting but also complicated for a beginner. Compared to Python, it feels like a mature language and you have to think (more) about what you're doing. Formal Syntax is complex and time-consuming, especially for me, since I'm not into Syntax. I'm more interested in semantics and therefore Logic is quite fancy - although sometimes it's as boring as Math. Acoustic phonetics hasn't started yet and Artificial Intelligence is fun but often too superficial - we aren't concerned with Natural Language Processing. Hence I plan to do a small pragmatic series on AI and to recap what I'll learn in the course. I also took two courses in English: Translation into German (boring as hell) and English to the 1700 century (boring lecturer), but they aren't worth to mention.
Finally, there's no post in my blogroll which could wake my interest. Except - well, in a not so positive way - for a news on Eureka Alert. This is one of the cases which I'd entitle 'the most unspectacular findings - that are no findings because everybody already knows - in science'. The study says that music is culturally independent when it comes to convey (certain) emotions. Well, everyone who's listened to a song in another language knows this. The study goes one step further and investigates the influence of music from fundamentally different cultures: Language of music really is universal, study finds
Finally, there's no post in my blogroll which could wake my interest. Except - well, in a not so positive way - for a news on Eureka Alert. This is one of the cases which I'd entitle 'the most unspectacular findings - that are no findings because everybody already knows - in science'. The study says that music is culturally independent when it comes to convey (certain) emotions. Well, everyone who's listened to a song in another language knows this. The study goes one step further and investigates the influence of music from fundamentally different cultures: Language of music really is universal, study finds
Monday, 16 March 2009
Points of Interest 16. March, 2009
1. Joint attention and the difference between we and I might be the key in order to understand language evolution.
My opinion is, that language evolved to set borders. The border of I; my person, my belongings, my needs and thoughts, my standing within society - in opposition to others.
The border of we; our tribe, our race, our territory, our hunting grounds, our ideals - in opposition to others.
Language is a cultural and psychological necessity to govern, identify, interact and define oneself if things are getting more complex. A biological necessity to define entities in time and space and to interact with them in a complex way on different planes. A way to optimize doing things, like hunting strategies, complex hierarchies, standings to other tribes: A Tale Without Episodes
2. A very strange article. As a computerlinguist I'm very interested in how they compute the evolution of a language. I can't imagine how this could work. There are several factors which seem to be just too random to be calculated. How did they trace the words? Did they use OCR to scan documents and match words? There is this particular paragraph, which I don't understand at all:
"Looking to the future, the less frequently certain words are used, the more likely they are to be replaced. Other simple rules have been uncovered - numerals evolve the slowest, then nouns, then verbs, then adjectives. Conjunctions and prepositions such as 'and', 'or', 'but' , 'on', 'over' and 'against' evolve the fastest, some as much as 100 times faster than numerals."
What does evolve mean here? Change? How can prepositions change? These things are bound by perception and are, in my opinion, as static as numerals: Scientists discover oldest words in the English language and predict which ones are likely to disappear in the future
3. I'm quite into morbid things; like mummies and anatomical stuff. Lately, Pink Tentacle wrote a post about monster mummies and living mummies - formerly known as living Buddhist monks. They've killed themselves (slowly) in a timespan of 3000 days and now are relics of a sort: Monster mummies of Japan
4. Does your language influence your preference in music? Language and music are associated: Why music sounds right - the hidden tones in our own speech
5. A very basic lecture about language held at Yale University. It covers fundamentals of linguistics - Pidgin & Creoles, language as human trait, language universals, (innate) language capacity, phonetics, morphology, syntax, semantics (in this order) and so on: How Do We Communicate?: Language in the Brain, Mouth and the Hands
My opinion is, that language evolved to set borders. The border of I; my person, my belongings, my needs and thoughts, my standing within society - in opposition to others.
The border of we; our tribe, our race, our territory, our hunting grounds, our ideals - in opposition to others.
Language is a cultural and psychological necessity to govern, identify, interact and define oneself if things are getting more complex. A biological necessity to define entities in time and space and to interact with them in a complex way on different planes. A way to optimize doing things, like hunting strategies, complex hierarchies, standings to other tribes: A Tale Without Episodes
2. A very strange article. As a computerlinguist I'm very interested in how they compute the evolution of a language. I can't imagine how this could work. There are several factors which seem to be just too random to be calculated. How did they trace the words? Did they use OCR to scan documents and match words? There is this particular paragraph, which I don't understand at all:
"Looking to the future, the less frequently certain words are used, the more likely they are to be replaced. Other simple rules have been uncovered - numerals evolve the slowest, then nouns, then verbs, then adjectives. Conjunctions and prepositions such as 'and', 'or', 'but' , 'on', 'over' and 'against' evolve the fastest, some as much as 100 times faster than numerals."
What does evolve mean here? Change? How can prepositions change? These things are bound by perception and are, in my opinion, as static as numerals: Scientists discover oldest words in the English language and predict which ones are likely to disappear in the future
3. I'm quite into morbid things; like mummies and anatomical stuff. Lately, Pink Tentacle wrote a post about monster mummies and living mummies - formerly known as living Buddhist monks. They've killed themselves (slowly) in a timespan of 3000 days and now are relics of a sort: Monster mummies of Japan
4. Does your language influence your preference in music? Language and music are associated: Why music sounds right - the hidden tones in our own speech
5. A very basic lecture about language held at Yale University. It covers fundamentals of linguistics - Pidgin & Creoles, language as human trait, language universals, (innate) language capacity, phonetics, morphology, syntax, semantics (in this order) and so on: How Do We Communicate?: Language in the Brain, Mouth and the Hands
Wednesday, 18 February 2009
Crying
Vacation! Last semester was very intense, no time for anything. Gosh, and now I'm trying to learn Java... It's far more complicated than imagined. Python was neat and easy - in comparison - but Java is far more potent, so they say. I don't quite understand why I've to write code which seems to be obsolete - at least for a beginner like me , e.g.
is equivalent to
in Python.
And next semester I'm going to die for sure... Here're my courses:
Formal Syntax: I'm more the Morphology, Phonology, Semantics type. Actually Syntax is the only thing which gives me problems.
Java: Well, it's going to be hard I'd say.
Logic: I like formal logic, really.
Artificial Intelligence: Very very interesting. Inference, neuronal networks, genetic algorithms and that stuff - you know...
Acoustic Phonetics: This sounds good: Reading spectrograms and get in touch with VoiceXML.
public class HelloWorld
{
public static void main(String argv[])
{
System.out.println("Hello World!");
}
}
is equivalent to
print "Hello World!"
in Python.
And next semester I'm going to die for sure... Here're my courses:
Formal Syntax: I'm more the Morphology, Phonology, Semantics type. Actually Syntax is the only thing which gives me problems.
Java: Well, it's going to be hard I'd say.
Logic: I like formal logic, really.
Artificial Intelligence: Very very interesting. Inference, neuronal networks, genetic algorithms and that stuff - you know...
Acoustic Phonetics: This sounds good: Reading spectrograms and get in touch with VoiceXML.
Labels:
About,
Academic,
Computational Linguistics,
Programming,
Python
Tuesday, 17 February 2009
Points of Interest 17. February, 2009
1. 20 years prison or death sentence for translating the Qu'ran? Welcome to our illustrious humanistic and democratic circle, dear Afghanistan: The dangers of translation
2. Help to classify galaxies and work for science, now! You'll be shown pictures of galaxies and asked questions about their spiral arms and other features. A task which you can do better than a computer and will help astronomy: Galaxy Zoo 2
3. The 2009 Jōyō Kanji update is coming up and the two favourites are 俺 (ore, which means informal "I") and 誰 (dare, question pronoun "who?"). I think "dare" is quite common these days, even in formal context. Of course one can say donata which is more formal: Jōyō list to level up
4. Speech perception is much more than hearing sounds. There are several other senses than hearing involved: Read my lips: Using multiple senses in speech perception
5. Steven Pinker explains the critical points of his book "The Blank Slate":
2. Help to classify galaxies and work for science, now! You'll be shown pictures of galaxies and asked questions about their spiral arms and other features. A task which you can do better than a computer and will help astronomy: Galaxy Zoo 2
3. The 2009 Jōyō Kanji update is coming up and the two favourites are 俺 (ore, which means informal "I") and 誰 (dare, question pronoun "who?"). I think "dare" is quite common these days, even in formal context. Of course one can say donata which is more formal: Jōyō list to level up
4. Speech perception is much more than hearing sounds. There are several other senses than hearing involved: Read my lips: Using multiple senses in speech perception
5. Steven Pinker explains the critical points of his book "The Blank Slate":
Subscribe to:
Posts (Atom)