Check out http://people.debian.org/~enrico/popsuggest.php.
You can upload there the popcon results of your system and have them compared with all the other submissions.
It will show the results grouped by the most relevant tags found in your system, and you can filter the results based on various package types that map to debtags expressions.
Credits also go to Eric Evans, for the continuous very useful and supportive discussions during all the development.
As I said, there is so much really cool stuff to be written, just within reach.
Some random notes taken during the development:
- The popcon dataset is on gluck in
/org/popcon.debian.org/popcon-mail/popcon-entries
- Local popcon scan results are in
/var/log/popularity-contest
- I now generate and publish aggregate frequency information
- Here are two links about TFIDF
- Here is some more literature about computing similarities; the Jaccard Index seems to be made exactly for Debtags tag sets.
- Xapian has python
bindings documented in
/usr/share/doc/python-xapian/bindings.html
and the full C++ API documentation is also useful