Software
Evolution's old odd mail folders to mbox
Something wrong happened in my dad's Evolution. It just would get stuck checking mail forever, with no useful diagnostic that I could find. Fun. Not.
Anyway, I solved by resetting everything to factory defaults, moving away all gconf entries and .evolution/ files. Then it started to work again, of course then I needed to reconfigure it from scratch.
It turned out however that some old mail was only archived locally, and in a kind of weird format that looks like this:
$ ls -la Enrico/
total 336
drwx------ 2 enrico enrico 4096 Jul 23 03:05 .
drwxr-xr-x 7 enrico enrico 4096 Jul 23 03:12 ..
-rw------- 1 enrico enrico 3230 Dec 4 2010 113.HEADER
-rw------- 1 enrico enrico 14521 Dec 4 2010 113.TEXT
-rw------- 1 enrico enrico 3209 Oct 22 2010 134.HEADER
-rw------- 1 enrico enrico 2937 Oct 22 2010 134.TEXT
-rw------- 1 enrico enrico 3116 Jun 27 2011 15.
-rw------- 1 enrico enrico 3678 Jun 27 2011 168.
-rw------- 1 enrico enrico 73 Apr 27 2009 22.1.MIME
-rw------- 1 enrico enrico 3199 Apr 27 2009 22.2
-rw------- 1 enrico enrico 88 Apr 27 2009 22.2.MIME
[...]
I couldn't even find the name of that mail folder layout, let alone conversion tools. So I had to sit down and waste my sunday break writing software to convert that to a mbox file. Here's the tool, may it save you the awful time I had today: http://anonscm.debian.org/gitweb/?p=users/enrico/evo2mbox.git
Note: feel free to fork it, or send patches, but don't bother with feature requests. Evolution isn't and won't be a personal interest of mine. Anything that makes an afternoon at my parents more tiresome than a whole busy month of paid work, doesn't deserve to be.
Luckily they now seem to have changed the local folder format to Maildir.
Giving away distromatch
at last year's Fosdem I tried to inject a lot of energy into distromatch but shortly afterwards I've had to urgently rewrite the nm.debian.org website.
After Lars Wirzenius GTDFH talks in Bologna and Varese I wrote a tool which, among other things, is able to scan my home dir and list how many projects I'm working on.
The output was scary. Like, they are too many. Like, I couldn't even recite the list out of memory. And since I couldn't do that, I had no idea there were so many. And I kept being stressful because I couldn't manage to take care of them all properly.
Now that I became conscious of the situation, it's time to deal with it like a grown up, and politely back off from some of my irresponsible responsibilities.
Distromatch is one of them. It had just started as a proof of concept prototype, and I had the vision that it could be the basis for a fantastic culture of sharing and exchange of information across distributions.
I need to distinguish the vision from the responsibility. I still have that vision for distromatch, but I cannot take responsibility for making it happen.
So I am giving it up to anyone who has the time and resources to pick up that responsibility.
Current status
It works well enough as a prototype. I believe it can successfully map a large enough slice of packages, that one can prototype stuff based on it.
I have for example used it to export the Debtags categories for other distros, and the resulting file looked big enough to be used for prototyping category-based features on distributions that don't have them yet.
I think it also works well enough to support a few common use cases, like sharing screenshots, or doing most of the work of converting dependency lists from a distro to another.
And finally, anyone can deploy it, and work on it.
Existing data sources
Everything I index in the Debian distromatch deployment is available at http://dde.debian.net/exports/distromatch/. The rpm-based data in there comes from an export script I wrote that runs on Sophie, but which I cannot maintain properly.
This is an experimental export of Fedora and OpenSUSE data: http://tmp.vuntz.net/misc/distromatch/distromatch-opensuse-fedora.tar
All existing export scripts are found in distromatch git repo on gitorious.
Contacts I gathered at Fosdem
At Fosdem I devoted quite some work to get contacts from all possible distributions and software repositories, so that distromatch could be hooked into them. Here is a dump of what I have collected:
- Debian: me
- OpenSuse: Vincent Untz and Adrian Schröter
- Fedora: Tom "Spot" Callaway
- Arch: Tasser on IRC
- CPAN: contact the people of https://metacpan.org/, on
irc.perl.org:#metacpanor make an issue on github - NetBSD: ask on
#netbsdon Freenode - FreeBSD: Baptiste Daroussin (bapt)
- Mageia: Olivier Thauvin
Some of those contacts may have "expired" in the meantime: I wouldn't assume all of them still remember talking with me, although most probably still do.
My commitment for the time being
I am happy to commit, at the moment, to maintaining a working data export for Debian data. I can take responsibility for making it so that the Debian data for it stays up to date, and to fix it asap if it isn't the case.
I hope that now someone can take distromatch over from me, and make it grow to achieve its great potential.
Debtags for derivative distributions
Sometimes I do cool stuff and I forget to announce it.
Ok, so I recently announced a new Debtags website.
I forgot to say in the announcement that the new website does not only know of Debian packages: see for example this page, at the very bottom it says: "Distributions: oneiric, precise, sid, testing".
This means that already, here and now, debtags.debian.net can be used to tag packages from both Debian and Ubuntu, and can easily be extended to cover the entire Debian ecosystem.
If you are a package maintainer, you will notice that your maintainer page shows your packages from everywhere. If you want to filter things a bit, for example hide obsolete packages from an old Debian Stable or Ubuntu LTS, just click on the "Settings" link on the top right to configure the page.
How it works
The magic is in this mergepackages script, which is run daily, and exports merged Packages files at dde.debian.net. The debtags.debian.net concept of Packages and Sources files are just those all-merged.gz and all-merged-sources.gz.
The merging is simple: that rebuild script processes files in order, and the first version of a package that is found is chosen as the base for the one that will go in the merged Packages file. Some fields like "Description" are just taken from this pivot package, others like Architecture or dependencies are merged into it. It's arbitrary, but works for me: the result has all the packages with all their possible architectures and dependencies, and is ready to be indexed with apt-xapian-index.
At the moment I pull data from Debian and Ubuntu, but you can see that the script can easily be extended to pull data from any Debian-style ftp archive, so any Debian derivative can go in. I've already started negotiations with the Derivatives Census on how to add any Debian derivative and keep the list up to date.
How to export tags for your own distribution
I'll use Ubuntu as an example since the data is already available.
The way you add Debtags to the Ubuntu packages file is just this one:
- Get the full reviewed tag database
- Optionally filter out those packages that you are not interested in
- Tweak this script to build an overrides file.
- Give the overrides file to your favourite ftp archive building tool.
The make-overrides is a bit rusty: if you improve it, please send me your
changes.
That is it, nothing else required, no excuses, it's ready, here, now!
Hitches and gotchas
This merged Packages file is a bit of a hack, and suffers from name conflicts across distributions, where two different softwares are packaged in two different distributions with the same name.
Ideally, name conflicts should not happen: if a derivative decided to package
kate and call it gedit, they deserve to have it tagged uitoolkit::gtk.
I think it's rather important that the whole Debian ecosystem works as much as
possible with a single package namespace.
However, that reasoning fails if you take time into account: packages get
renamed, like git and chromium, and may mean completely different things,
for example, if you compare Debian Stable with Debian Sid.
This last is a problem caused by debtags only working with package names but not package versions. I have a strategy in mind based on being able to override the stable tag database using headers in debian/control; it still needs some details sorted out, but I'm confident we will be able to address these issues properly soon enough.
Why stop at the Debian ecosystem?
Why indeed. I'm clearly trying to use FOSDEM, and the CrossDistribution devroom as the venue to discuss just that.
Deploying distromatch
I have been working on allowing anyone to set up their own distromatch instance.
For Debian and Ubuntu, I can easily generate the distromatch input using UDD and the Contents files found in any mirrors.
For the whole RPM world, thanks to Olivier Thauvin I have been able to set up regular exports from the vast Sophie database.
I have set up distromatch access on DDE, which can also serve as a list of all working distributions so far. If you have access to the full dataset of package names and package contents for a distribution not in that list, please get in touch and we can add it.
I'm also exporting the full raw dataset which enables anyone to set up the same distromatch environment on their own machines.
Here is how:
# Get distromatch
git clone git://gitorious.org/appstream/distromatch.git
cd distromatch
# Fetch distribution information (updated every 2 days)
wget http://dde.debian.net/exports/distromatch-all.tar.gz
# Unpack it
mkdir data
tar -C data -zxf distromatch-all.tar.gz
# Reindex it (use --verbose if you are curious)
./distromatch --datadir=data --reindex --verbose
# Run it
./distromatch --datadir=data debian gedit
What does this mean? For example it means that if another distribution has some data (categories, screenshots...) that your distribution doesn't have, you can use distromatch to translate package names, then go and get it!
My next step is going to be to improve the distromatch functionality in DDE and possibly build a simple user friendly web interface to it. If you have some JQuery experience and would like to help, don't wait to get in touch.
Released cfget 0.18
I have released version 0.18 of cfget.
Changes:
- Allow empty comment lines
- Added Cfget.load_from_env to allow to easily load a working Cfget object from other python code
- Fixed some exception handling and error reporting during parsing of expressions
update-apt-xapian-index on other distros
I've drafted a little HOWTO on using apt-xapian-index on non-Debian distributions.
The procedure has been tried on Mageia with some success, and there's no reason it wouldn't work everywhere else: the index itself does not depend on anything distro-specific.
A prototype webby markety appy thing
What better way to introduce my work at an Application Installer meeting than to come with a prototype package browser modeled after shopping sites developed in just a few hours?
It's a little Flask webapp that just works on any Debian system, using the local apt-xapian-index as a backend. It has fast keyword search, faceted navigation and screenshots, and it runs on your system showing the packages that you have available.
To try it:
git clone git://git.debian.org/users/enrico/pkgshelf.git
cd pkgshelf
./web-server.py
Then visit http://localhost:5000
It didn't have much interface polishing, as it's just a quick technology demo. However you can see that:
- keyword search is fast (fast enought that it could be made to search as you type);
- relevant tags appear on the left, grouped by facets;
- the most relevant tags are highlighted;
- the less relevant tags could be hidden behind a
[more]expander; - you can choose several strategies to hide packages you may find irrelevant.
Things that need doing:
- hiding uninteresting facets;
- making it pretty.
It's essentially JavaScript and CSS work. Anyone wants to play?
Match package names across distributions
What would happen if we had a quick and reliable way to match package names across distributions?
These ideas came up at the appinstaller2011 meeting:
- it would be easy to lookup screenshots in the local distro, and if there are none then fall back on other distributions;
- it would be easy to port Debtags to other distributions, and possibly get changes back;
- it would be trivial to add a
[patches in $DISTRO]link to the PTS - it would be easy to point to other BTSes
We thought they were good ideas, so we started hacking.
To try it, you need to get the code and build the index first:
git clone git://git.debian.org/users/enrico/distromatch.git
cd distromatch
# Careful: 90Mb
wget http://people.debian.org/~enrico/dist-info.tar.gz
tar zxf dist-info.tar.gz
# Takes a long time to do the indexing
./distromatch --reindex --verbose
Then you can query it this way:
./distromatch $DISTRO $PKGNAME [$PKGNAME1 ...]
This would give you, for the package $PKGNAME in $DISTRO, the corresponding package names in all other distros for which we have data. If you do not provide package names, it automatically shows output for all packages in $DISTRO.
For example:
$ time ./distromatch debian libdigest-sha1-perl
debian:libdigest-sha1-perl fedora:perl-Digest-SHA1
debian:libdigest-sha1-perl mandriva:perl-Digest-SHA1
debian:libdigest-sha1-perl suse:perl-Digest-SHA1
real 0m0.073s
user 0m0.056s
sys 0m0.016s
Yes it's quick. It builds a Xapian index with the information it needs, and then it reuses it. As soon as I find a moment, I intend to deploy an instance of it on DDE.
It is using a range of different heuristics:
- match packages by name;
- match packages by desktop files contained within;
- match packages by pkg-config metadata files contained within;
- match packages by [/usr]/bin/* files contained within;
- match packages by shared library files contained within;
- match packages by devel library files contained within;
- match packages by man pages contained within;
- match stemmed form of development library package names;
- match stemmed form of shared library package names;
- match stemmed form of perl library package names;
- match stemmed form of python library package names.
This list may get obsolete soon as more heuristics get implemented.
Euristics will never cover all corner cases we surely have, but the idea is that if we can match a sizable amout of packages, the rest can be somehow fixed by hand as needed.
The data it requires for a distribution should be rather straightforward to generate:
- a file which maps binary package names to source package names
- a file with the list of files in all the packages
For example:
$ ls -l dist-debian/
total 39688
-rw-r--r-- 1 enrico enrico 1688249 Jan 20 17:37 binsrc
drwxr-xr-x 2 enrico enrico 4096 Jan 21 19:12 db
-rw-r--r-- 1 enrico enrico 29960406 Jan 21 10:02 files.gz
-rw-r--r-- 1 enrico enrico 8914771 Jan 21 18:39 interesting-files
$ head dist-debian/binsrc
openoffice.org-dev openoffice.org
ext4-modules-2.6.32-5-4kc-malta-di linux-kernel-di-mipsel-2.6
linux-headers-2.6.30-2-common linux-2.6
libnspr4 nspr
ipfm ipfm
libforks-perl libforks-perl
med-physics debian-med
libntfs-3g-dev ntfs-3g
libguppi16 guppi
selinux selinux
$ zcat dist-debian/files.gz | head
memstat etc/memstat.conf
memstat usr/bin/memstat
memstat usr/share/doc/memstat/changelog.gz
memstat usr/share/doc/memstat/copyright
memstat usr/share/doc/memstat/memstat-tutorial.txt.gz
memstat usr/share/man/man1/memstat.1.gz
libdirectfb-dev usr/bin/directfb-config
libdirectfb-dev usr/bin/directfb-csource
libdirectfb-dev usr/include/directfb-internal/core/clipboard.h
libdirectfb-dev usr/include/directfb-internal/core/colorhash.h
interesting-files and db are generated when indexing.
To prove the usefulness of the idea (but does it need proving?), you can find in the same git repo a little example app (it took me 10 minutes to write it), that uses the distromatch engine to export Debtags tags to other distributions:
$ ./exportdebtags fedora | head
memstat: admin::benchmarking, interface::commandline, role::program, use::monitor
libdirectfb-dev: devel::lang:c, devel::library, implemented-in::c, interface::framebuffer, role::devel-lib
libkonqsidebarplugin4a: implemented-in::c++, role::shared-lib, suite::kde, uitoolkit::qt
libemail-simple-perl: devel::lang:perl, devel::library, implemented-in::perl, role::devel-lib, role::shared-lib, works-with::mail
libpoe-component-pluggable-perl: devel::lang:perl, devel::library, implemented-in::perl, role::shared-lib
manpages-ja: culture::japanese, made-of::man, role::documentation
libhippocanvas-dev: devel::library, qa::low-popcon, role::devel-lib
libexpat-ocaml-dev: devel::lang:ocaml, devel::library, implemented-in::c, implemented-in::ocaml, role::devel-lib, works-with-format::xml
libgnutls-dev: devel::library, role::devel-lib, suite::gnu
Just in case this made you itch to play with Debtags in a non-Debian distribution, I've generated the full datasets for Fedora, Mandriva and OpenSUSE.
Others have been working on the same matching problem. After we started writing code we started to become aware of existing work:
- whohas
- PackageMap
- Equivalent-Packages, statistically generated from package contents, more info in this post
I'd like to make use of those efforts, maybe to cross-validate results, maybe even better as yet another heuristics.
Update:
I built a simple distromatch query system into DDE!
Award winning code
Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/
We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it.
This is the story of what we did, including the code we wrote.
The original dataset has several JSON files, so the first task was to put them all together:
#!/usr/bin/python # Merge the JSON data # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import os res = [] for f in os.listdir("."): if not f.startswith("gmp24"): continue data = open(f).read().strip() if data == "[]": continue parsed = simplejson.loads(data) res.extend(parsed) print simplejson.dumps(res)
The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work.
Here is code to sort the JSON data by time:
#!/usr/bin/python # Sort the JSON data # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import sys import datetime as dt all_recs = simplejson.load(sys.stdin) all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y")) simplejson.dump(all_recs, sys.stdout)
I then wanted to play with Tf-idf for extracting the most important words of every tweet:
#!/usr/bin/python # tfifd - Annotate JSON elements with Tf-idf extracted keywords # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import sys, math import simplejson import re # Read all the twits records = simplejson.load(sys.stdin) # All the twits by ID byid = dict(((x["id"], x) for x in records)) # Stopwords we ignore stopwords = set(["by", "it", "and", "of", "in", "a", "to"]) # Tokenising engine re_num = re.compile(r"^\d+$") re_word = re.compile(r"(\w+)") def tokenise(tweet): "Extract tokens from a tweet" for tok in tweet["text"].split(): tok = tok.strip().lower() if re_num.match(tok): continue mo = re_word.match(tok) if not mo: continue if mo.group(1) in stopwords: continue yield mo.group(1) # Extract tokens from tweets tokenised = dict(((x["id"], list(tokenise(x))) for x in records)) # Aggregate token counts aggregated = {} for d in byid.iterkeys(): for t in tokenised[d]: if t in aggregated: aggregated[t] += 1 else: aggregated[t] = 1 def tfidf(doc, tok): "Compute TFIDF score of a token in a document" return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok]) # Annotate tweets with keywords res = [] for name, tweet in byid.iteritems(): doc = tokenised[name] keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5] tweet["keywords"] = keywords res.append(tweet) simplejson.dump(res, sys.stdout)
I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet.
Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...)
The code is simple; the biggest work in it was the dictionary of keywords:
#!/usr/bin/python # categorise - Annotate JSON elements with categories # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # Copyright (C) 2010 Yuwei Lin <yuwei@ylin.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import sys, math import simplejson import re # Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester placenames = ["Altrincham", "Sale West", "Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village", "Ashton-under-Lyne", "Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter", "Blackley", "Broughton", "Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal", "Bolton North East", "Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh", "Bolton South East", "Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth", "Bolton West", "Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South", "Bury North", "Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington", "Bury South", "Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth", "Cheadle", "Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill", "Denton", "Reddish", "Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South", "Hazel Grove", "Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton", "Heywood", "Middleton", "Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton", "Leigh", "Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley", "Makerfield", "Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes", "Manchester Central", "Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston", "Manchester", "Gorton", "Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range", "Manchester", "Withington", "Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington", "Oldham East", "Saddleworth", "Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead", "Oldham West", "Royton", "Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth", "Rochdale", "Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough", "Salford", "Eccles", "Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley", "Stalybridge", "Hyde", "Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South", "Stockport", "Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor", "Stretford", "Urmston", "Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston", "Wigan", "Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West", "Worsley", "Eccles South", "Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley", "Wythenshawe", "Sale East", "Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"] # Manual coding from Yuwei placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton", "Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill", "Prestwich", "Sale", "Kearsley", ]) placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton", "Tameside", "Blackley", "City centre", "Airport", "South Manchester", "Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton", "Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire", "Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton", "Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington", "Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath", "Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley", "Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden", "Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury", "Westhoughton"]) # Social categories from Yuwei soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly", "patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child", "children", "immigrant", "female", "workmen", "boy", "girl", "foster parents", "next of kin"] for i in range(100): soccat.append("%d-year-old" % i) soccat.append("%d-years-old" % i) # Types of social locations from Yuwei socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop", "property", "credit card", "supermarket", "garden", "phone box", "theatre", "toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk", "hotel", "restaurant", "cafe", "petrol station", "bank", "school", "university"] extras = { "placename": placenames, "soccat": soccat, "socloc": socloc } # Normalise keyword lists for k, v in extras.iteritems(): # Remove duplicates v = list(set(v)) # Sort by length v.sort(key=lambda x:len(x), reverse=True) # Add keywords def add_categories(tweet): text = tweet["text"].lower() for field, categories in extras.iteritems(): for cat in categories: if cat.lower() in text: tweet[field] = cat break return tweet # Read all the twits records = (add_categories(x) for x in simplejson.load(sys.stdin)) simplejson.dump(list(records), sys.stdout)
All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on.
In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:
#!/usr/bin/python # Convert the JSON twits to CSV # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import sys import csv rows = ["id", "created_at", "text", "keywords", "placename"] writer = csv.writer(sys.stdout) for rec in simplejson.load(sys.stdin): rec["keywords"] = " ".join(rec["keywords"]) rec["placename"] = rec.get("placename", "") writer.writerow([rec[row] for row in rows])
At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play.
This script reads all the JSON tweets and builds a Xapian index with them:
#!/usr/bin/python # toxapian - Index JSON tweets in Xapian # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import simplejson import sys import os, os.path import xapian DBNAME = sys.argv[1] db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN) stemmer = xapian.Stem("english") indexer = xapian.TermGenerator() indexer.set_stemmer(stemmer) indexer.set_database(db) data = simplejson.load(sys.stdin) for rec in data: doc = xapian.Document() doc.set_data(str(rec["id"])) indexer.set_document(doc) indexer.index_text_without_positions(rec["text"]) # Index categories as categories if "placename" in rec: doc.add_boolean_term("XP" + rec["placename"].lower()) if "soccat" in rec: doc.add_boolean_term("XS" + rec["soccat"].lower()) if "socloc" in rec: doc.add_boolean_term("XL" + rec["socloc"].lower()) db.add_document(doc) db.flush() # Also save the whole dataset so we know where to find it later if we want to # show the details of an entry simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))
And this is a simple command line tool to query to the database:
#!/usr/bin/python # xgrep - Command line tool to query the GMP24 tweet Xapian database # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import simplejson import sys import os, os.path import xapian DBNAME = sys.argv[1] db = xapian.Database(DBNAME) stem = xapian.Stem("english") qp = xapian.QueryParser() qp.set_default_op(xapian.Query.OP_AND) qp.set_database(db) qp.set_stemmer(stem) qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME) qp.add_boolean_prefix("place", "XP") qp.add_boolean_prefix("soc", "XS") qp.add_boolean_prefix("loc", "XL") query = qp.parse_query(sys.argv[2], xapian.QueryParser.FLAG_BOOLEAN | xapian.QueryParser.FLAG_LOVEHATE | xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE | xapian.QueryParser.FLAG_WILDCARD | xapian.QueryParser.FLAG_PURE_NOT | xapian.QueryParser.FLAG_SPELLING_CORRECTION | xapian.QueryParser.FLAG_AUTO_SYNONYMS) enquire = xapian.Enquire(db) enquire.set_query(query) count = 40 matches = enquire.get_mset(0, count) estimated = matches.get_matches_estimated() print "%d/%d results" % (matches.size(), estimated) data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json")))) for m in matches: rec = data[m.document.get_data()] print rec["text"] print "%d/%d results" % (matches.size(), matches.get_matches_estimated()) total = db.get_doccount() estimated = matches.get_matches_estimated() print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)
Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event.
It helped that I had already written similar code for apt-xapian-index and dde before.
Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.
#!/usr/bin/python # xserve - Make the GMP24 tweet Xapian database available on the web # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import bottle from bottle import route, post from cStringIO import StringIO import cPickle as pickle import simplejson import sys import os, os.path import xapian import urllib import math bottle.debug(True) DBNAME = sys.argv[1] QUERYLOG = os.path.join(DBNAME, "queries.txt") data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json")))) prefixes = { "place": "XP", "soc": "XS", "loc": "XL" } prefix_desc = { "place": "Place name", "soc": "Social category", "loc": "Social location" } db = xapian.Database(DBNAME) stem = xapian.Stem("english") qp = xapian.QueryParser() qp.set_default_op(xapian.Query.OP_AND) qp.set_database(db) qp.set_stemmer(stem) qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME) for k, v in prefixes.iteritems(): qp.add_boolean_prefix(k, v) def make_query(qstring): return qp.parse_query(qstring, xapian.QueryParser.FLAG_BOOLEAN | xapian.QueryParser.FLAG_LOVEHATE | xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE | xapian.QueryParser.FLAG_WILDCARD | xapian.QueryParser.FLAG_PURE_NOT | xapian.QueryParser.FLAG_SPELLING_CORRECTION | xapian.QueryParser.FLAG_AUTO_SYNONYMS) @route("/") def index(): query = urllib.unquote_plus(bottle.request.GET.get("q", "")) out = StringIO() print >>out, ''' <html> <head> <title>Query</title> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script> <script type="text/javascript"> $(function(){ $("#queryfield")[0].focus() }) </script> </head> <body> <h1>Search</h1> <form method="POST" action="/query"> Keywords: <input type="text" name="query" value="%s" id="queryfield"> <input type="submit"> <a href="http://xapian.org/docs/queryparser.html">Help</a> </form>''' % query print >>out, ''' <p>Example: "car place:wigan"</p> <p>Available prefixes:</p> <ul> ''' for pfx in prefixes.keys(): print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx]) print >>out, ''' </ul> ''' oldqueries = [] if os.path.exists(QUERYLOG): total = db.get_doccount() fd = open(QUERYLOG, "r") while True: try: q = pickle.load(fd) except EOFError: break oldqueries.append(q) fd.close() def print_query(q): count = q["count"] print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total) print >>out, "<p>Last 10 queries:</p><ul>" for q in oldqueries[:-10:-1]: print_query(q) print >>out, "</ul>" # Remove duplicates oldqueries = dict(((x["q"], x) for x in oldqueries)).values() print >>out, "<table>" print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>" print >>out, "<tr><td>" print >>out, "<ul>" oldqueries.sort(key=lambda x:x["count"], reverse=True) for q in oldqueries[:10]: print_query(q) print >>out, "</ul>" print >>out, "</td><td>" print >>out, "<ul>" nonempty = [x for x in oldqueries if x["count"] > 0] nonempty.sort(key=lambda x:x["count"]) for q in nonempty[:10]: print_query(q) print >>out, "</ul>" print >>out, "</td></tr>" print >>out, "</table>" print >>out, ''' </body> </html>''' return out.getvalue() @route("/query") @route("/query/") @post("/query") @post("/query/") def query(): query = bottle.request.POST.get("query", bottle.request.GET.get("query", "")) enquire = xapian.Enquire(db) enquire.set_query(make_query(query)) count = 40 matches = enquire.get_mset(0, count) estimated = matches.get_matches_estimated() total = db.get_doccount() out = StringIO() print >>out, ''' <html> <head><title>Results</title></head> <body> <h1>Results for "<b>%s</b>"</h1> ''' % query if estimated == 0: print >>out, "No results found." else: # Give as results the first 30 documents; also use them as the key # ones to use to compute relevant terms rset = xapian.RSet() for m in enquire.get_mset(0, 30): rset.add_document(m.document.get_docid()) # Compute the tag cloud class NonTagFilter(xapian.ExpandDecider): def __call__(self, term): return not term[0].isupper() and not term[0].isdigit() cloud = [] maxscore = None for res in enquire.get_eset(40, rset, NonTagFilter()): # Normalise the score in the interval [0, 1] weight = math.log(res.weight) if maxscore == None: maxscore = weight tag = res.term cloud.append([tag, float(weight) / maxscore]) max_weight = cloud[0][1] min_weight = cloud[-1][1] cloud.sort(key=lambda x:x[0]) def mklink(query, term): return "/query?query=%s" % urllib.quote_plus(query + " and " + term) print >>out, "<h2>Tag cloud</h2>" print >>out, "<blockquote>" for term, weight in cloud: size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight) print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term) print >>out, "</blockquote>" print >>out, "<h2>Results</h2>" print >>out, "<p><a href='/'>Search again</a></p>" print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total) print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated) print >>out, "<ul>" for m in matches: rec = data[m.document.get_data()] print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"]) print >>out, "</ul>" fd = open(QUERYLOG, "a") qinfo = dict(q=query, count=estimated) pickle.dump(qinfo, fd) fd.close() print >>out, ''' <a href="/">Search again</a> </body> </html>''' return out.getvalue() @route("/item/:id") @route("/item/:id/") def show(id): rec = data[id] out = StringIO() print >>out, ''' <html> <head><title>Result %s</title></head> <body> <h1>Raw JSON record for twit %s</h1> <pre>''' % (rec["id"], rec["id"]) print >>out, simplejson.dumps(rec, indent=" ") print >>out, ''' </pre> </body> </html>''' return out.getvalue() @route("/catinfo/:name") @route("/catinfo/:name/") def catinfo(name): prefix = prefixes[name] out = StringIO() print >>out, ''' <html> <head><title>Values for %s</title></head> <body> ''' % name terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)] terms.sort(key=lambda x:x[1], reverse=True) freq_min = terms[0][1] freq_max = terms[-1][1] def mklink(name, term): return "/query?query=%s" % urllib.quote_plus(name + ":" + term) # Build tag cloud print >>out, "<h1>Tag cloud</h1>" print >>out, "<blockquote>" for term, freq in sorted(terms[:20], key=lambda x:x[0]): size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min) print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term) print >>out, "</blockquote>" print >>out, "<h1>All terms</h1>" print >>out, "<table>" print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>" for term, freq in terms: print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term) print >>out, "</table>" print >>out, ''' </body> </html>''' return out.getvalue() # Change here for bind host and port bottle.run(host="0.0.0.0", port=8024)
...and then we presented our work and ended up winning the contest.
This was the story of how we wrote this set of award winning code.
Released cfget 0.17
I have released version 0.17 of cfget.
Changes:
- Fixed a DeprecationWarning with python 2.6
- Allow empty values in configuration files
- The round() function now returns an int, not a float
