Data formats
This is a reference for the various file and data formats one may encounter when working with debtags. A list of data sources is found in the chapter Available data sources.
Tag name
The full tag name is composed of a facet name followed by double colon, followed by the tag name.
The facet name can technically contain anything except for a double colon or a comma followed by a space; in practice however, only alphanumeric characters and dashes are currently used.
The tag name can technically contain anything except for a comma followed by a space; in practice, however, only alphanumeric characters, pluses, dashes and colons are used.
Colons in the tag name are sometimes used to organise the tags of the same facet in a hierarchical way.
Facet and tag names are case sensitive.
Examples of tag names
devel::bugtracker
devel::code-generator
devel::i18n
devel::lang:c
devel::lang:c++
implemented-in::c
implemented-in::c++
interface::3d
interface::commandline
interface::text-mode
works-with-format::xml
works-with-format::xml:rss
works-with-format::xml:xslt
Tags in the Packages files
The tags of a package in the Debian Packages file are stored in the Tag:
header.
The format is a simple list of tags, sorted in alphabetical order, and
separated by comma and one or more spaces. The separator is matched by the
regular expression /,\s+/
.
In order to cope with maximum line length limitations in the Packages file format, multiple tags belonging to the same facet are grouped using curly braces.
Examples of tag fields in the Packages file
# Empty field
Tag:
# Field with only one tag
Tag: web::server
# Field with many tags
Tag: made-of::data:font, role::app-data, use::printing, works-with::text, works-with-format::postscript
# Field with grouped tags
Tag: made-of::{data:font,data:html,data:pdf,data:tex}, role::app-data, role::documentation, suite::TODO, use::typesetting, works-with::{font,image,image:vector,text}, works-with-format::{pdf,postscript,tex}, x11::font
Tag database
Many debtags and debtags-related tools can store an entire tag database in a single file.
Example uses are:
- The available tag information stored by
debtags
in/var/lib/debtags/package-tags
- The tag database downloadable from
http://debtags.alioth.debian.org/tags/tags-current.gz
- The reviewed tag database in
svn://svn.debian.org/debtags/tagdb/tags
- The input and output of the
tagcoll
utility
Every line in the file is able to associate one or more packages to zero or more tags.
A line is composed by a list of package names and a list of tag names,
separated by a colon and one or more spaces. This separator can be matched by
the regular expression /:\s+/
.
The package names and tag names are instead separated by a comma and one or
more spaces, as represented by the regular expression /,\s+/
.
A line of the tag database can therefore be parsed using the following perl code:
my ($packages, $tags) = split(/:\s+/, $line);
my @packages = split(/,\s+/, $packages);
my @tags = split(/,\s+/, $tags);
A package name can appear in multiple lines of the tag database, but not on the same line: this allows to easily merge multiple tag database by just concatenating the files. For convenience, however, most debtags tools create databases where package names appear only once.
No curly-brace grouping is allowed in a tag database.
The tagcoll
tool can be used to work with information in this format. These
are some example operations:
# Merge tag informations so that package names only appears once in the
# file
tagcoll -g copy my-tagdb
# Convert a packages -> tags mapping to a tags -> packages mapping
tagcoll reverse my-tagdb
# Compute the differences between two tag databases
tagcoll diff my-tagdb1 my-tagdb2 > tag-patch
# Apply a tag patch
tagcoll -p tag-patch copy my-tagdb1 > same-as-my-tagdb2
# Filter a tag database, keeping only the lines matching a given tag
# expression
tagcoll grep 'role::program && interface::x11' my-tagdb
Empty lists of tags can be used to record that a package is known to have no tags, but they are usually discarded.
Example lines of a tag database
xmame-x: hardware::emulation
abakus: field::mathematics, interface::x11, role::program, scope::utility, suite::kde, uitoolkit::qt, x11::application
gnomermind, gnotepad+, gnotime: interface::x11, role::program, scope::application, suite::gnome, uitoolkit::gtk, x11::application
Vocabulary
The tag vocabulary, that can be found in /var/lib/debtags/vocabulary
, in
http://debtags.alioth.debian.org/tags/vocabulary.gz
and in
svn://svn.debian.org/debtags/vocabulary/trunk/debian-packages
, has the same
RFC822-style format as the Debian Packages file.
The debtags vocabulary contains two kinds of records: a facet description and a tag description. A facet description starts with the "Facet:" header, while the tag description starts with the "Tag:" header.
Facet and tag records usually contain the "Description:" field that has the
exact same behaviour as the "Description:" field in the debian/control
file,
including the distinction between short and long description, the dot to
represent empty lines and the other formatting rules.
There can be other fields besides "Tag:", "Facet:" and "Description:", but they are currently ignored.
Example record of a tag vocabulary
Facet: protocol
Description: Network Protocol
Status: draft
Tag: protocol::TODO
Description: Need an extra tag
The package can be categorised along this facet, but the right tag for it is
missing.
.
Mark a package with this tag to signal the vocabulary maintainers of cases
where the current tag set is lacking.
Tag: protocol::atm
Description: ATM
Asynchronous Transfer Mode, a high speed protocol for communication between
computers in a network.
.
While ATM is used to implement *DSL networks, it has never gained widespread
use as a technology for building local area networks (LANs), for which it was
originally intended.
.
Link: http://en.wikipedia.org/wiki/Asynchronous_Transfer_Mode
Tag: protocol::bittorrent
Description: BitTorrent
BitTorrent is a protocol for peer-to-peer based file distribution over
network.
.
Although the actual data transport happens between BitTorrent clients, one
central node, the so-called trackers, is needed to keep a list of all clients
that download or provide the same file.
.
Link: http://www.bittorrent.com/
Link: http://en.wikipedia.org/wiki/BitTorrent
Tag sources.list
The debtags
tool uses a sources.list
file similar to the one used by APT,
to configure a number of tag data sources. The format is the same as APT,
except that only the sources starting with tags
are considered by debtags
.
All sources not starting with tags
are ignored, and this could allow sharing
a sources.list between APT and debtags. Unfortunately however, APT however
does not currently ignore the tags
sources, and refuses to work if it finds
one.
Example sources.list file
# Source locations to get tag database info
#
# You can specify more than one; during update, debtags will merge the contents
# of all the sources. The merged data can be found in
# /var/lib/debtags/package-tags and /var/lib/debtags/vocabulary.
# Reviewed tag data from the APT Packages file
tags apt://
# Regularly updated, but unchecked, database on the web
# This tries to access the files:
# http://debtags.alioth.debian.org/tags/tags-current.gz
# http://debtags.alioth.debian.org/tags/vocabulary.gz
#
# The vocabulary, if existing, is used to filter the tag database: all
# tags in tags-current.gz are discarded if they are not found in
# the vocabulary. After filtering, the vocabulary is merged with the
# other available vocabulary information.
#
# If the vocabulary is not found in the tag source, the tag data is
# filtered using the system vocabulary.
#
#tags http://debtags.alioth.debian.org/tags/
# Package rating from http://www.iterating.org
#tags http://www.iterating.org/tags
# Example of locally maintained custom tag data
#
# You can provide your own tag data. To do so, place a file called
# vocabulary.gz and a file called tags-current.gz in a directory, and use the
# directory as a tag source.
#
# In the example given below, you would place the tag data in
# /etc/debtags/mytags/tags-current.gz, and the vocabulary data in
# /etc/debtags/mytags/vocabulary.gz
#
# The format of tags-current.gz is described in the tagcoll(1) manpage and the
# file is similar to /var/lib/debtags/package-tags.
#
# The format of vocabulary.gz is like the Debian Packages file, and the file is
# similar to /var/lib/debtags/vocabulary.
#
#tags file:/etc/debtags/mytags
Binary index
debtags update
create an indexed version of the tag database in
/var/lib/debtags/package-tags.idx
, an index for the tag vocabulary in
/var/lib/debtags/vocabulary.idx
and an index for package names in
/var/lib/debtags/pkgidx.idx
.
All these indexes indexes are built to be quickly accessed using mmap
.
All these .idx
files contain a chain of indexes, each prefixed by an int
value stating its length. The structure of an idx
file is therefore:
[size of index 1][index1][size of index 2][index2]...
Code for accessing the indexes is found in the package libtagcoll-dev
, in
diskindex/mmap.h
.
The vocabulary index
The vocabulary index contains a facet index and a tag index.
To build the facet index, facet names are sorted alphabetically and then assigned a number starting from 0.
The facet index begins with one table with one int
value per facet,
containing the offset of the facet data that follows the table. Just after the
table begins the facet data, as a sequence of structures:
struct Item {
// Offset of the facet record in the text vocabulary file
int offset;
// Size of the facet record in the text vocabulary file
int size;
// Tag index ID of the first tag in this facet
int firsttag;
// Tag index ID of the last tag in this facet
int lasttag;
// Facet name (0-terminated)
const char name[];
};
The facet index allows O(1)
retrieval of a facet record by facet ID and
O(log(number of facets))
retrieval of a facet ID from a facet name, as the
facets are sorted alphabetically and this allows to use binary search on the
facet offset table.
The facet index also allows direct access to the list of tags for a facet, providing a range of tag index IDs.
The tag index is built similarly to the facet index: the tags are sorted alphabetically and then assigned a number starting from 0.
The tag index begins with one table with one int
value per tag, containing
the offset of the tag data that follows the table. Just after the table begins
the tag data, as a sequence of structures:
struct Item {
// Offset of the tag record in the text vocabulary file
int offset;
// Size of the tag record in the text vocabulary file
int size;
// Facet index ID of the facet that contains this tag
int facet;
// Tag name (0-terminated, includes the facet name)
const char name[];
};
The tag index allows O(1)
retrieval of a tag record by tag ID and
O(log(number of tags))
retrieval of a tag ID from a tag name, as the tags are
sorted alphabetically and this allows to use binary search on the tag offset
table.
The tag index also allows O(1)
access to the facet information for a tag,
using the facet index ID provided.
Tag index entries and facet index entries can be padded after the tag or facet name to guarantee alignment of structures to machine word boundaries.
The code to access the vocabulary index can be found in libept-dev
in
ept/cache/debtags/vocabulary.h
and ept/cache/debtags/vocabularymerger.h
:
// Build the facet index
void VocabularyMerger::FacetIndexer::encode(char* buf) const
{
int pos = vm.facets.size() * sizeof(int);
for (std::map<std::string, FacetData>::const_iterator f = vm.facets.begin(); f != vm.facets.end(); f++)
{
((int*)buf)[f->second.id] = pos;
// offset of record in vocabulary
*(int*)(buf+pos) = f->second.ofs;
pos += sizeof(int);
// size of record in vocabulary
*(int*)(buf+pos) = f->second.len;
pos += sizeof(int);
// id of first tag
*(int*)(buf+pos) = f->second.tags.begin()->second.id;
pos += sizeof(int);
// id of last tag
*(int*)(buf+pos) = f->second.tags.rbegin()->second.id;
pos += sizeof(int);
// name (0-terminated)
memcpy(buf + pos, f->first.c_str(), f->first.size() + 1);
pos += f->first.size() + 1;
// Align to int boundaries
if ((pos % sizeof(int)) != 0)
pos = (pos + sizeof(int)) / sizeof(int) * sizeof(int);
}
}
// Build the tag index
void VocabularyMerger::TagIndexer::encode(char* buf) const
{
int pos = vm.tagCount * sizeof(int);
for (std::map<std::string, FacetData>::const_iterator f = vm.facets.begin(); f != vm.facets.end(); f++)
{
for (std::map<std::string, TagData>::const_iterator t = f->second.tags.begin();
t != f->second.tags.end(); t++)
{
((int*)buf)[t->second.id] = pos;
// offset of record in vocabulary
*(int*)(buf+pos) = t->second.ofs;
pos += sizeof(int);
// size of record in vocabulary
*(int*)(buf+pos) = t->second.len;
pos += sizeof(int);
// id of facet
*(int*)(buf+pos) = f->second.id;
pos += sizeof(int);
// name (0-terminated)
string name = f->first + "::" + t->first;
memcpy(buf + pos, name.c_str(), name.size() + 1);
pos += name.size() + 1;
// Align to int boundaries
if ((pos % sizeof(int)) != 0)
pos = (pos + sizeof(int)) / sizeof(int) * sizeof(int);
}
}
}
The package name index
The package name index allows to quickly retrieve a package name using the package ID that the APT index assigns to it.
This is needed to support creating a tag index that uses only integer values, as the APT library does not seem to provide a quick way to go from package ID to package name.
The index contains an int table and a string table. The int table contains an int per package ID and points to the package name in the string table.
Since the APT IDs are contiguous, the int table can be indexed directly using
the package ID, and this allows very fast and O(1)
retrieval of the package
name.
The code to access the package name index can be found in libept-dev
in
ept/cache/debtags/pkgidx.h
, and the code to generate the index is found in
the Generator
class in ept/cache/debtags/update.tcc
:
// Get the package name from the package ID
const char* name(int id) const
{
if (id >= 0 || static_cast<unsigned>(id) < size())
return m_buf + ((int*)m_buf)[id];
return NULL;
}
// Build the package name index
void encode(char* buf) const
{
typedef typename ept::t::cache::Package<C> Package;
// First sort the packages by ID
std::vector<Package> pkgs;
pkgs.resize(m_pkgs.index().packageCount() + 1);
for (typename Index::iterator i = m_pkgs.index().begin(); i != m_pkgs.index().end(); ++i)
pkgs[i->ondiskId()] = *i;
// Then write them out
int pos = pkgs.size() * sizeof(int);
int idx = 0;
for (typename std::vector<Package>::const_iterator i = pkgs.begin(); i != pkgs.end(); ++i)
{
((int*)buf)[idx++] = pos;
memcpy(buf + pos, i->name( std::string( "invalid" ) ).c_str(),
i->name( std::string( "invalid" ) ).size() + 1);
pos += i->name( std::string( "invalid" ) ).size() + 1;
}
}
The tag index
Using the vocabulary tag index we can have fast bidirectional mapping of tag names to integers; using the APT library and the package name index we can have fast bidirectional mapping of package names to integers. This means that we can encode the association of packages and tags using only fixed-length integer values.
The tag index contains two symmetrical indexes with the same format: the first one allows retrieval of an array of tag IDs from a package ID, and the second one allows retrieval of an array of package IDs given a tag ID.
The format of one int mapping is as follows:
[offset of mapping for item 0, offset of mapping for item 1...]
[size of array][sorted array of ints pointed by index 0]
[size of array][sorted array of ints pointed by index 1]
[size of array][sorted array of ints pointed by index 2]
[...]
[number of items in the mapping]
To get the list of tag IDs associated to a package ID, we use the package ID as the index of the offset table, and we obtain the offset of the tag IDs.
Using the tag IDs offset, we can read the size of the tag ID array and the array just afterwards.
This allows to perform various fast operations:
- Lookup of a tag list from a package ID in
O(1)
- Fast computation of merged or intersected tag lists, by exploiting the fact that the arrays of tag IDs are sorted.
Since two symmetrical mappings are stored, this is all valid also from all operations that require getting the list of packages associated to a tag, or the merged or intersected lists of packages associated to a number of tags.
The code to access an int mapping is found in libtagcoll-dev
in
tagcoll/diskindex/int.h
:
// Get the array of tag IDs associated to a package ID
// (or the array of package IDs associated to a tag ID, depending on
// what index we are accessing)
const int* data(int val) const
{
return (val >= 0 && (unsigned)val < size()) ? buf() + ofs(val) + 1 : 0;
}
// Get the size of an ID array
size_t size(int val) const
{
return (val >= 0 && (unsigned)val < size()) ? buf()[ofs(val)] : 0;
}
// Get the number of mappings in the index
size_t size() const { return ofs(0); }
/**
* Creates an on-disk index to use for IntIndex
*/
class IntIndexer : public MMapIndexer, public std::vector<std::set<int> >
{
public:
/// Store the key->val mapping into the indexer
void map(unsigned int key, int val)
{
if (size() <= key)
resize(key + 1);
(*this)[key].insert(val);
}
/// Return the size of the encoded index data
virtual int encodedSize() const;
/// Write the index data in the given buffer, which should be at least
/// encodedSize bytes
virtual void encode(char* buf) const
{
int* buf = (int*)cbuf;
int pos = size();
for (size_t i = 0; i < size(); i++)
{
buf[i] = pos;
buf[pos++] = (*this)[i].size();
for (set<int>::const_iterator j = (*this)[i].begin(); j != (*this)[i].end(); j++)
buf[pos++] = *j;
}
}
};
The code to implement fast tag operations using the two symmetrical indexes is
found in libtagcoll-dev
in tagcoll/coll/intdiskindex.h
and in libept-dev
in ept/cache/debtags/tagmap.h
.