..

Media Cloud multi-language support via plug-in infrastructure, plus an automatic stemmer and a stopword list generator

(Accepted and successfully completed project; original proposal.)

I propose adding multi-language support to Media Cloud in such a way that the new languages (French, German, …) could be added as plug-ins (adding a new language would not involve further modifications to the core code). Additionally, the process of adding a language would also be made simpler by providing automatic (experimental) stopword list and stemmer generators.

About Me

Description

My name is Linas [pronounced lee-nahs], I’m from Vilnius, Lithuania. Currently, I’m an undergraduate journalism student in Vilnius University, Faculty of Communication.

I’ve started programming with Perl when I was fourteen years old. Strangely enough, Perl was the first programming language that I’ve learned; I was introduced to it by a “hacker” magazine that I’ve bought (those were very popular at the time). The magazine claimed that Perl is “cool” and would, in fact, make me a “hacker”.

It wasn’t long before I found out that Perl is, indeed, pretty cool. This landed me a first-ever job while still being fourteen. I’m not too proud of the nature of the job (so to say), but here it is: I had to create a bot that would automatically “vote” for a certain music band on a radio station’s online “Top 40” listings. The bot would work by downloading a list of HTTP / SOCKS 5 proxies from a certain website and then submitting a vote through each of them.

Again, that wasn’t a flagship project of mine (and it happened ten years ago anyway), but I hope it shows that I had some early experience developing server-side applications in Perl. Thankfully, the “client” for this particular software got busted (my guess is that the radio station managers decided that it is unlikely that so many fans of the band X live in Zimbabwe), and I moved on into web and mobile development (as exhibited in my CV).

Later, while studying journalism in Vilnius University, I had various experience in developing public transport software (figuring out how to calculate shortest paths in huge, country-wide graphs), online media websites (learning the “do’s” and “dont’s” of a 1000 rq/s site), English-Lithuanian dictionaries (thanks to that, now I know what a stemmer is) and iPhone applications (varying from a simple news reader to a public transport timetable viewer with an ability to calculate those shortest paths client-side).

Last year, I was lucky enough to contribute to the wxWidgets project by bridging Cocoa Touch (Objective-C API used for developing iOS applications) and wxWidgets UI library together. The wxWidgets project (which was a part of GSoC 2011) taught me about how to get into and understand a huge, previously unknown code base.

Right now, I would like to try to find my way back to crawler programming, extend my knowledge of natural language processing and have a chance to play with a huge amount of data. Thus, I’m writing this proposal to the Media Cloud project.

Personal Gains

By participating in the Media Cloud project, I hope to refresh my Perl skills, gain some more NLP experience, help opening up an interesting and useful software to a wider (international) audience, and to further improve my ability to read someone else’s code and research papers.

Why I’m interested

I got interested in the Media Cloud project because, as a journalism student, I see a great potential of such software being applied to the online media. Media Cloud could help answering many important questions and caching various ongoing trends in real-time (as they happen), and it’s just that the adaptability of this software could be improved by implementing certain features (e.g. multi-language support, which I propose in this document).

Frankly, having to work with someone from Harvard is a motivator too.

Dates and times

I have reviewed the Important dates and times for GSoC 2012. I have no conflicts with the schedule.

Although I live in UTC+3 timezone and Boston, MA is in UTC-4, the work hours in Boston translate to “afternoon - midnight” in Lithuania, so I have no problem with that too.

I do understand that Google Summer of Code is a serious commitment, equivalent to a full-time job.

Code samples

Proposal

Overview

I propose adding multi-language support to Media Cloud in such a way that the new languages (French, German, …) could be added as plug-ins (adding a new language would not involve further modifications to the core code). Additionally, the process of adding a language would also be made simpler by providing automatic (experimental) stopword list and stemmer generators.

Objectives:

Languages and technology: Perl, PostgreSQL, Snowball stemmer, Subversion (GIT + SVN).

Detailed

At the time of writing, Media Cloud supports a limited set of natural languages (English, Russian, Chinese), and the support for those languages is implemented right in the core of the application in a (isEnglish()) { ... } elsif (isChinese()) { ... } elsif (isRussian()) { ... } manner.

I propose adding multi-language support to Media Cloud in such a way that the new languages (French, German, …) could be added as plug-ins. With this approach, adding support for a new language would not require one to modify the core code, and the whole process would be easier to accomplish.

Benefits:

Objectives in detail:

  1. Move the language-dependent code and assets into separate plug-ins.
    • This would be done for each of the currently supported languages (English, Russian, Chinese).
      • Later, for the Lithuanian language too (as a test case).
    • The plug-in for each language would allow one to configure:
      • a stopword list
      • a word stemmer (either a language code passed to the Lingua::Stem::Snowball CPAN module or a custom subroutine altogether)
      • a way to detect a link to a next page of an article (either via keyword(s) or a subroutine)
      • a way to detect the boundaries of a sentence (either by looking for a full stop symbol ‘.’ or with a subroutine)
      • a way to extract separate words from a sentence (either by looking for a whitespace symbol ‘ ‘ or with a subroutine)
      • other means of language-specific setup
  2. Modify the core code so that it would use a particular configured language plug-in for the inner processes.
    • The main language would have to be set in a main configuration file, mediawords.yml, or some other place.
    • Then, the tagger (sentence / word extractor) would initialize a plug-in for that particular language and use the language-specific data provided by the plug-in to split texts into sentences, sentences into words, words into their stems, detect “next pages”, etc.
    • Unit tests will have to be adapted accordingly so that they both test the new language support, and the tests do pass.
  3. Create an experimental stopword list generator.
    • The experimental stopword list generator would make it easier for the end-user to provide a stopword list if they don’t have one for their language.
    • The stopword list generator would employ one, several or a mix of the methods described in:
  4. Create an experimental stemmer generator.
    • The experimental stemmer generator would make it easier for the end-user to provide a language stemmer if they don’t have one for their language.
    • The stemmer generator would produce a Snowball language source of a stemmer (or some other - possibly Perl - data structure that would fit the nature of the stemmer) which could be used by Lingua::Stem or Lingua::Stem::Snowball.
    • The stemmer generator would employ one, several or a mix of the methods described in:
  5. Add a new language (Lithuanian) in order to test how the newly created plug-in architecture and experimental generators work.
    • Not a major language of the world, but it would serve as a good test case of a newly created plug-in architecture.
    • Corpus for Lithuanian should be easy to find (I know some places to ask for it, and lt.wikipedia.org might be a last retreat).
  6. Document the process of adding a new language in a README.
    • The README would be a step-by-step guide describing a way to add a new language to Media Cloud.

Project Plan

Project plan distinguishes several levels of objective completeness. The levels are loosely defined as follows:

Plan: