Debtags algorithms

Relevance scoring

Suppose you have a filtering operation O (such as a full-text search) that operates a selection on packages, and returns you a subset P1 of the total set of packages P.

It is possible to compare P and P1 and score each tag by how much it is related to O.

Define the cardinality of a tag T as the number of packages that have the tag.

The relevance scoring function for a tag T after a filtering operation is:

                      2
cardinality of T in P1
----------------------
cardinality of T in P

that is, the square of the cardinality of T in P1 divided by the cardinality of T in P.

This simple scoring function allows to correlate tags to any other search operation on packages, allowing to fit a context sensitive tag search in almost any kind of package interface.

You can see this function in action at:

The Smart Package Search page
The smartsearch, wxssearch and tagsByRelevance examples of the python-debian package
The "Add rel" button in debtags-edit
The pmnew and pmnew-gui prototypes
The debtags smartsearch function.

Tag search

The relevance scoring can also be used to implement a smart tag search function that gives far better results than a full text search on tag names and descriptions:

Use the keywords to perform a full text search on the package descriptions
Use the results to score tags by relevance, and show the most relevant tags

Besides using this technique to search tags, it can be used to turn a keyword search (which is easy to input) into a tag search (which easy to navigate).

Discriminance scoring

Another useful scoring function is by "discriminance".

Suppose you have an interface that shows a list of tags and allows users to perform the two actions:

"I want only packages that have this tag"
"I want only packages that do not have this tag"

Then it becomes important to have a scoring function that tells you what are those packages that, when chosen in either of these actions, would give the shortest resulting package list.

The function is simple to compute:

let P be the number of available packages
let T be the cardinality of a tag
then the discriminance index is simply min(T, P-T)

This scoring function can be recomputed after every selection operated by the user. The resulting interface allows a user to quickly get to the packages they need without having a clear idea of what they actually need.