Getting JSON out of Tweepy

Tweepy is a really nice and easy to use python module for accessing the Twitter API. It wraps results in its own model instances for specific types, eg user and status. This is fine, until you just want to get the raw JSON as returned by Twitter. The following monkeypatch solves this and sets a json property containing the JSON response data.

 

Language detection with python nltk

There are quite a number of python libraries out there for determining the language a text was written in. The way most of them work is by creating chunks of text with length n, and counting how many times these sequences occur. The matches are then compared to a trained corpus of sequences and their frequencies for various languages, which will add a weight to the results. The language corpus generating the highest score will determine the text’s language. It’s a simple method and surprisingly effective.

Not too long ago a language corpus was added to the excellent python nltk module with trigram counts for 451 languages. Other than that there really isn’t something ready-made in the nltk module for language detection so you’ll need to fold your own, which actually isn’t that hard. The first thing you’ll need is a corpus reader to deal with the corpus files, second is a way to create trigrams (three letter sequences) from your text and match it against the corpus to create a score per language.

from nltk.util import trigrams as nltk_trigrams
from nltk.tokenize import word_tokenize as nltk_word_tokenize
from nltk.probability import FreqDist
from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView, concat

class LangIdCorpusReader(CorpusReader):
    '''
    LangID corpus reader
    '''
    CorpusView = StreamBackedCorpusView

    def _get_trigram_weight(self, line):
        '''
        Split a line in a trigram and its frequency count
        '''
        data = line.strip().split(' ')
        if len(data) == 2:
            return (data[1], int(data[0]))

    def _read_trigram_block(self, stream):
        '''
        Read a block of trigram frequencies
        '''
        freqs = []
        for i in range(20): # Read 20 lines at a time.
            freqs.append(self._get_trigram_weight(stream.readline()))
        return filter(lambda x: x != None, freqs)

    def freqs(self, fileids=None):
        '''
        Return trigram frequencies for a language from the corpus        
        '''
        return concat([self.CorpusView(path, self._read_trigram_block) 
                       for path in self.abspaths(fileids=fileids)])

class LangDetect(object):
    language_trigrams = {}
    langid            = LazyCorpusLoader('langid', LangIdCorpusReader, r'(?!\.).*\.txt')

    def __init__(self, languages=['nl', 'en', 'fr', 'de', 'es']):
        for lang in languages:
            self.language_trigrams[lang] = FreqDist()
            for f in self.langid.freqs(fileids=lang+"-3grams.txt"):
                self.language_trigrams[lang].inc(f[0], f[1])

    def detect(self, text):
        '''
        Detect the text's language
        '''
        words    = nltk_word_tokenize(text.lower())
        trigrams = {}
        scores   = dict([(lang, 0) for lang in self.language_trigrams.keys()])

        for match in words:
            for trigram in self.get_word_trigrams(match):
                if not trigram in trigrams.keys():
                    trigrams[trigram] = 0
                trigrams[trigram] += 1

        total = sum(trigrams.values())

        for trigram, count in trigrams.items():
            for lang, frequencies in self.language_trigrams.items():
                # normalize and add to the total score
                scores[lang] += (float(frequencies[trigram]) / float(frequencies.N())) * (float(count) / float(total))

        return sorted(scores.items(), key=lambda x: x[1], reverse=True)[0][0]

    def get_word_trigrams(self, match):
        return [''.join(trigram) for trigram in nltk_trigrams(match) if trigram != None]

To see if it actually works:

texts = [
     "De snelle bruine vos springt over de luie hond",
     "The quick brown fox jumps over the lazy dog",
     "Le renard brun rapide saute par-dessus le chien paresseux",
     "Der schnelle braune Fuchs springt über den faulen Hund",
     "El rápido zorro marrón salta sobre el perro perezoso"
]

ld = LangDetect()

for text in texts:
    print text, "=>", ld.detect(text)

Which correctly yields:

De snelle bruine vos springt over de luie hond => nl
The quick brown fox jumps over the lazy dog => en
Le renard brun rapide saute par-dessus le chien paresseux => fr
Der schnelle braune Fuchs springt über den faulen Hund => de
El rápido zorro marrón salta sobre el perro perezoso => es

And that’s it. If you’re looking for a php implementation see this post , which served as an example for this implementation.

Playing with Lithium

I finally found some time to play a little with Lithium, a brand new php 5.3 based framework. I did have a quick look when the first versions got released and already liked it back then, it's fast, clean and makes full use of the latest php features. One other reason for jumping in was its support for MongoDb, a fast schema free database. When working my way through the obligatory blog sample I ended up with the following for reading a MongoDb entry by id in the controller view action:



$post = Post::find('first', array('conditions' => array('_id' => new MongoId($id))));

This works perfectly well, and is explicit, which I like. Still, I'd like to abstract away the MongoDb specifics, and have a shorthand call like Post::read($id) available for reading a single item. Lithium doesn't offer this but it has a powerful filtering system which enables passing closures to modify behaviour at runtime. It's very flexible and as I see a good example of aspect oriented programming. For creating the shorthand function I had to create two filters, one for converting the string id into a MongoId instance, the other is a 'finder' filter, enabling the read() call.

 <?php
namespace uw_posts\models;

use \MongoId;

class Post extends \lithium\data\Model
{
    /**
     * Set up default connection options and connect default finders.
     *
     * Parent override which registers:
     *
     * <ul>
     *     <li>a find filter for coverting a string id to a MongoId instance</li>
     *     <li>a 'read' finder, which enables <code>Model::read($id)</code></li>
     * </ul>
     *
     * @see lithium\data\Model
     * @param array $config
     * @return void
     */
    public static function __init($config = array())
    {
        parent::__init($config);

        // filter for converting a string id into a MongoId instance
        static::applyFilter('find', function($self, $params, $chain){

            $conditions = $params['options']['conditions'];

            if (isset($conditions['id']) and preg_match('/^[0-9a-f]{24}$/', $conditions['id']))
            {
                $params['options']['conditions']['_id'] = new MongoId($conditions['id']);

                unset($params['options']['conditions']['id']);
            }

            return $chain->next($self, $params, $chain);
        });

        // read finder
        static::finder('read', function($self, $params, $chain) {

            $conditions = $params['options']['conditions'];

            if (isset($conditions['_id']))
            {
                return $self::find('first', array('conditions' => array('id' => $conditions['_id'])));
            }

            return $chain->next($self, $params, $chain);
        });
    }
}

It's quite compact and efficient and keeps all functionality nicely isolated. One drawback of the 'read' finder filter as quickly implemented here is that it doesn't allow for other filters to be passed when the 'read' finder gets called. One can make every function filterable though, so going around this will be to create an explicit read() call in the model and making it filterable, \lithium\data\Model contains examples on how to do this.

First post

Trying out posterous!



<?php
   echo "Hello world";