Representative keywords of DPL platforms
The DPL platforms are too long and you could use a very, very short executive summary? No problem, I have the technology for it.
After the results you can find the kit to build yourself an extractor in the comfort of your home.
The results
- 93sam: jobs, deal, dds, nms, asking
- aigarius: applications, aigarius, choose, trademarks, apps
- ajt: humbug, effective, neat, hoping, success
- hertzog: broadly, wouter, serve, represent, pushed
- sho: deadline, helps, excellence, freaks, tasks
- sjr: qb, published, xxxx, r, yet
- stratus: stable, websites, feature, submitter, involving
- svenl: unfair, protest, ban, publish, banning
- wouter: controversy, seem, background, delegates, therefore
Acquiring the data
for i in 93sam aigarius ajt hertzog sho sjr stratus svenl wouter
do
wget http://www.debian.org/vote/2007/platforms/$i
done
Tokenizing
#!/bin/sh
for file in "$@"
do
lynx -dump -stdin < $file | tr -c '[a-zA-Z]' ' ' | tr '[A-Z]' '[a-z]' | sed -e 's/ /\n/g' | sed -e '/^$/d' > $file.tok
done
Extracting the most representative keywords
#!/usr/bin/python
import sys, math
def read_tokens(file):
"Read all the tokens from one file"
return [ line[:-1] for line in open(file) ]
# Read all the "documents"
docs = [ read_tokens(file) for file in sys.argv[1:] ]
# Aggregate token counts
aggregated = {}
for d in docs:
for t in d:
if t in aggregated:
aggregated[t] += 1
else:
aggregated[t] = 1
def tfidf(doc, tok):
"Compute TFIDF score of a token in a document"
return doc.count(tok) * math.log(float(len(docs)) / aggregated[tok])
# Output the top 5 tokens by TFIDF for every document
for name, doc in zip(sys.argv[1:], docs):
print name, sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]
Errata
Jacobo suggests
to use lynx -dump -nolist or w3m -dump for a more tokenizer-friendly text expansion.