Debtags algorithms
Relevance scoring
Suppose you have a filtering operation O (such as a full-text search) that operates a selection on packages, and returns you a subset P1 of the total set of packages P.
It is possible to compare P and P1 and score each tag by how much it is related to O.
Define the cardinality of a tag T as the number of packages that have the tag.
The relevance scoring function for a tag T after a filtering operation is:
2 cardinality of T in P1 ---------------------- cardinality of T in P
that is, the square of the cardinality of T in P1 divided by the cardinality of T in P.
This simple scoring function allows to correlate tags to any other search operation on packages, allowing to fit a context sensitive tag search in almost any kind of package interface.
You can see this function in action at:
- The Smart Package Search page
- The
smartsearch
,wxssearch
andtagsByRelevance
examples of thepython-debian
package - The "Add rel" button in
debtags-edit
- The pmnew and pmnew-gui prototypes
- The
debtags smartsearch
function.
Tag search
The relevance scoring can also be used to implement a smart tag search function that gives far better results than a full text search on tag names and descriptions:
- Use the keywords to perform a full text search on the package descriptions
- Use the results to score tags by relevance, and show the most relevant tags
Besides using this technique to search tags, it can be used to turn a keyword search (which is easy to input) into a tag search (which easy to navigate).
Discriminance scoring
Another useful scoring function is by "discriminance".
Suppose you have an interface that shows a list of tags and allows users to perform the two actions:
- "I want only packages that have this tag"
- "I want only packages that do not have this tag"
Then it becomes important to have a scoring function that tells you what are those packages that, when chosen in either of these actions, would give the shortest resulting package list.
The function is simple to compute:
- let P be the number of available packages
- let T be the cardinality of a tag
- then the discriminance index is simply min(T, P-T)
This scoring function can be recomputed after every selection operated by the user. The resulting interface allows a user to quickly get to the packages they need without having a clear idea of what they actually need.