Data formats

This is a reference for the various file and data formats one may encounter when working with debtags. A list of data sources is found in the chapter Available data sources.

Tag name

The full tag name is composed of a facet name followed by double colon, followed by the tag name.

The facet name can technically contain anything except for a double colon or a comma followed by a space; in practice however, only alphanumeric characters and dashes are currently used.

The tag name can technically contain anything except for a comma followed by a space; in practice, however, only alphanumeric characters, pluses, dashes and colons are used.

Colons in the tag name are sometimes used to organise the tags of the same facet in a hierarchical way.

Facet and tag names are case sensitive.

Examples of tag names

devel::bugtracker
devel::code-generator
devel::i18n
devel::lang:c
devel::lang:c++
implemented-in::c
implemented-in::c++
interface::3d
interface::commandline
interface::text-mode
works-with-format::xml
works-with-format::xml:rss
works-with-format::xml:xslt

Tags in the Packages files

The tags of a package in the Debian Packages file are stored in the Tag: header.

The format is a simple list of tags, sorted in alphabetical order, and separated by comma and one or more spaces. The separator is matched by the regular expression /,\s+/.

In order to cope with maximum line length limitations in the Packages file format, multiple tags belonging to the same facet are grouped using curly braces.

Examples of tag fields in the Packages file

# Empty field
Tag:
# Field with only one tag
Tag: web::server
# Field with many tags
Tag: made-of::data:font, role::app-data, use::printing, works-with::text, works-with-format::postscript
# Field with grouped tags
Tag: made-of::{data:font,data:html,data:pdf,data:tex}, role::app-data, role::documentation, suite::TODO, use::typesetting, works-with::{font,image,image:vector,text}, works-with-format::{pdf,postscript,tex}, x11::font

Tag database

Many debtags and debtags-related tools can store an entire tag database in a single file.

Example uses are:

The available tag information stored by debtags in /var/lib/debtags/package-tags
The tag database downloadable from http://debtags.alioth.debian.org/tags/tags-current.gz
The reviewed tag database in svn://svn.debian.org/debtags/tagdb/tags
The input and output of the tagcoll utility

Every line in the file is able to associate one or more packages to zero or more tags.

A line is composed by a list of package names and a list of tag names, separated by a colon and one or more spaces. This separator can be matched by the regular expression /:\s+/.

The package names and tag names are instead separated by a comma and one or more spaces, as represented by the regular expression /,\s+/.

A line of the tag database can therefore be parsed using the following perl code:

my ($packages, $tags) = split(/:\s+/, $line);
my @packages = split(/,\s+/, $packages);
my @tags = split(/,\s+/, $tags);

A package name can appear in multiple lines of the tag database, but not on the same line: this allows to easily merge multiple tag database by just concatenating the files. For convenience, however, most debtags tools create databases where package names appear only once.

No curly-brace grouping is allowed in a tag database.

The tagcoll tool can be used to work with information in this format. These are some example operations:

# Merge tag informations so that package names only appears once in the
# file
tagcoll -g copy my-tagdb

# Convert a packages -> tags mapping to a tags -> packages mapping
tagcoll reverse my-tagdb

# Compute the differences between two tag databases
tagcoll diff my-tagdb1 my-tagdb2 > tag-patch

# Apply a tag patch
tagcoll -p tag-patch copy my-tagdb1 > same-as-my-tagdb2

# Filter a tag database, keeping only the lines matching a given tag
# expression
tagcoll grep 'role::program && interface::x11' my-tagdb

Empty lists of tags can be used to record that a package is known to have no tags, but they are usually discarded.

Example lines of a tag database

xmame-x: hardware::emulation
abakus: field::mathematics, interface::x11, role::program, scope::utility, suite::kde, uitoolkit::qt, x11::application
gnomermind, gnotepad+, gnotime: interface::x11, role::program, scope::application, suite::gnome, uitoolkit::gtk, x11::application

Vocabulary

The tag vocabulary, that can be found in /var/lib/debtags/vocabulary, in http://debtags.alioth.debian.org/tags/vocabulary.gz and in svn://svn.debian.org/debtags/vocabulary/trunk/debian-packages, has the same RFC822-style format as the Debian Packages file.

The debtags vocabulary contains two kinds of records: a facet description and a tag description. A facet description starts with the "Facet:" header, while the tag description starts with the "Tag:" header.

Facet and tag records usually contain the "Description:" field that has the exact same behaviour as the "Description:" field in the debian/control file, including the distinction between short and long description, the dot to represent empty lines and the other formatting rules.

There can be other fields besides "Tag:", "Facet:" and "Description:", but they are currently ignored.

Example record of a tag vocabulary

Facet: protocol
Description: Network Protocol
Status: draft

Tag: protocol::TODO
Description: Need an extra tag
 The package can be categorised along this facet, but the right tag for it is
 missing.
 .
 Mark a package with this tag to signal the vocabulary maintainers of cases
 where the current tag set is lacking.

Tag: protocol::atm
Description: ATM
 Asynchronous Transfer Mode, a high speed protocol for communication between
 computers in a network.
 .
 While ATM is used to implement *DSL networks, it has never gained widespread
 use as a technology for building local area networks (LANs), for which it was
 originally intended.
 .
 Link: http://en.wikipedia.org/wiki/Asynchronous_Transfer_Mode

Tag: protocol::bittorrent
Description: BitTorrent
 BitTorrent is a protocol for peer-to-peer based file distribution over
 network.
 .
 Although the actual data transport happens between BitTorrent clients, one
 central node, the so-called trackers, is needed to keep a list of all clients
 that download or provide the same file.
 .
 Link: http://www.bittorrent.com/
 Link: http://en.wikipedia.org/wiki/BitTorrent

Tag sources.list

The debtags tool uses a sources.list file similar to the one used by APT, to configure a number of tag data sources. The format is the same as APT, except that only the sources starting with tags are considered by debtags.

All sources not starting with tags are ignored, and this could allow sharing a sources.list between APT and debtags. Unfortunately however, APT however does not currently ignore the tags sources, and refuses to work if it finds one.

Example sources.list file

# Source locations to get tag database info
#
# You can specify more than one; during update, debtags will merge the contents
# of all the sources.  The merged data can be found in
# /var/lib/debtags/package-tags and /var/lib/debtags/vocabulary.

# Reviewed tag data from the APT Packages file
tags apt://

# Regularly updated, but unchecked, database on the web
# This tries to access the files:
#   http://debtags.alioth.debian.org/tags/tags-current.gz
#   http://debtags.alioth.debian.org/tags/vocabulary.gz
#
# The vocabulary, if existing, is used to filter the tag database: all
# tags in tags-current.gz are discarded if they are not found in
# the vocabulary.  After filtering, the vocabulary is merged with the
# other available vocabulary information.
#
# If the vocabulary is not found in the tag source, the tag data is
# filtered using the system vocabulary.
#
#tags http://debtags.alioth.debian.org/tags/

# Package rating from http://www.iterating.org
#tags http://www.iterating.org/tags

# Example of locally maintained custom tag data
#
# You can provide your own tag data.  To do so, place a file called
# vocabulary.gz and a file called tags-current.gz in a directory, and use the
# directory as a tag source.
#
# In the example given below, you would place the tag data in
# /etc/debtags/mytags/tags-current.gz, and the vocabulary data in
# /etc/debtags/mytags/vocabulary.gz
#
# The format of tags-current.gz is described in the tagcoll(1) manpage and the
# file is similar to /var/lib/debtags/package-tags.
#
# The format of vocabulary.gz is like the Debian Packages file, and the file is
# similar to /var/lib/debtags/vocabulary.
#
#tags file:/etc/debtags/mytags

Binary index

debtags update create an indexed version of the tag database in /var/lib/debtags/package-tags.idx, an index for the tag vocabulary in /var/lib/debtags/vocabulary.idx and an index for package names in /var/lib/debtags/pkgidx.idx.

All these indexes indexes are built to be quickly accessed using mmap.

All these .idx files contain a chain of indexes, each prefixed by an int value stating its length. The structure of an idx file is therefore:

[size of index 1][index1][size of index 2][index2]...

Code for accessing the indexes is found in the package libtagcoll-dev, in diskindex/mmap.h.

The vocabulary index

The vocabulary index contains a facet index and a tag index.

To build the facet index, facet names are sorted alphabetically and then assigned a number starting from 0.

The facet index begins with one table with one int value per facet, containing the offset of the facet data that follows the table. Just after the table begins the facet data, as a sequence of structures:

struct Item {
    // Offset of the facet record in the text vocabulary file
    int offset;
    // Size of the facet record in the text vocabulary file
    int size;
    // Tag index ID of the first tag in this facet
    int firsttag;
    // Tag index ID of the last tag in this facet
    int lasttag;
    // Facet name (0-terminated)
    const char name[];
};

The facet index allows O(1) retrieval of a facet record by facet ID and O(log(number of facets)) retrieval of a facet ID from a facet name, as the facets are sorted alphabetically and this allows to use binary search on the facet offset table.

The facet index also allows direct access to the list of tags for a facet, providing a range of tag index IDs.

The tag index is built similarly to the facet index: the tags are sorted alphabetically and then assigned a number starting from 0.

The tag index begins with one table with one int value per tag, containing the offset of the tag data that follows the table. Just after the table begins the tag data, as a sequence of structures:

struct Item {
    // Offset of the tag record in the text vocabulary file
    int offset;
    // Size of the tag record in the text vocabulary file
    int size;
    // Facet index ID of the facet that contains this tag
    int facet;
    // Tag name (0-terminated, includes the facet name)
    const char name[];
};

The tag index allows O(1) retrieval of a tag record by tag ID and O(log(number of tags)) retrieval of a tag ID from a tag name, as the tags are sorted alphabetically and this allows to use binary search on the tag offset table.

The tag index also allows O(1) access to the facet information for a tag, using the facet index ID provided.

Tag index entries and facet index entries can be padded after the tag or facet name to guarantee alignment of structures to machine word boundaries.

The code to access the vocabulary index can be found in libept-dev in ept/cache/debtags/vocabulary.h and ept/cache/debtags/vocabularymerger.h:

// Build the facet index
void VocabularyMerger::FacetIndexer::encode(char* buf) const
{
    int pos = vm.facets.size() * sizeof(int);

    for (std::map<std::string, FacetData>::const_iterator f = vm.facets.begin(); f != vm.facets.end(); f++)
    {
        ((int*)buf)[f->second.id] = pos;

        // offset of record in vocabulary
        *(int*)(buf+pos) = f->second.ofs;
        pos += sizeof(int);

        // size of record in vocabulary
        *(int*)(buf+pos) = f->second.len;
        pos += sizeof(int);

        // id of first tag
        *(int*)(buf+pos) = f->second.tags.begin()->second.id;
        pos += sizeof(int);

        // id of last tag
        *(int*)(buf+pos) = f->second.tags.rbegin()->second.id;
        pos += sizeof(int);

        // name (0-terminated)
        memcpy(buf + pos, f->first.c_str(), f->first.size() + 1);
        pos += f->first.size() + 1;

        // Align to int boundaries
        if ((pos % sizeof(int)) != 0)
            pos = (pos + sizeof(int)) / sizeof(int) * sizeof(int);
    }
}

// Build the tag index
void VocabularyMerger::TagIndexer::encode(char* buf) const
{
    int pos = vm.tagCount * sizeof(int);

    for (std::map<std::string, FacetData>::const_iterator f = vm.facets.begin(); f != vm.facets.end(); f++)
    {
        for (std::map<std::string, TagData>::const_iterator t = f->second.tags.begin();
                t != f->second.tags.end(); t++)
        {
            ((int*)buf)[t->second.id] = pos;

            // offset of record in vocabulary
            *(int*)(buf+pos) = t->second.ofs;
            pos += sizeof(int);

            // size of record in vocabulary
            *(int*)(buf+pos) = t->second.len;
            pos += sizeof(int);

            // id of facet
            *(int*)(buf+pos) = f->second.id;
            pos += sizeof(int);

            // name (0-terminated)
            string name = f->first + "::" + t->first;
            memcpy(buf + pos, name.c_str(), name.size() + 1);
            pos += name.size() + 1;

            // Align to int boundaries
            if ((pos % sizeof(int)) != 0)
                pos = (pos + sizeof(int)) / sizeof(int) * sizeof(int);
        }
    }
}

The package name index

The package name index allows to quickly retrieve a package name using the package ID that the APT index assigns to it.

This is needed to support creating a tag index that uses only integer values, as the APT library does not seem to provide a quick way to go from package ID to package name.

The index contains an int table and a string table. The int table contains an int per package ID and points to the package name in the string table.

Since the APT IDs are contiguous, the int table can be indexed directly using the package ID, and this allows very fast and O(1) retrieval of the package name.

The code to access the package name index can be found in libept-dev in ept/cache/debtags/pkgidx.h, and the code to generate the index is found in the Generator class in ept/cache/debtags/update.tcc:

// Get the package name from the package ID
const char* name(int id) const
{
    if (id >= 0 || static_cast<unsigned>(id) < size())
        return m_buf + ((int*)m_buf)[id];
    return NULL;
}

// Build the package name index
void encode(char* buf) const
{
    typedef typename ept::t::cache::Package<C> Package;

    // First sort the packages by ID
    std::vector<Package> pkgs;
    pkgs.resize(m_pkgs.index().packageCount() + 1);
    for (typename Index::iterator i = m_pkgs.index().begin(); i != m_pkgs.index().end(); ++i)
        pkgs[i->ondiskId()] = *i;

    // Then write them out
    int pos = pkgs.size() * sizeof(int);
    int idx = 0;
    for (typename std::vector<Package>::const_iterator i = pkgs.begin(); i != pkgs.end(); ++i)
    {
        ((int*)buf)[idx++] = pos;
        memcpy(buf + pos, i->name( std::string( "invalid" ) ).c_str(),
           i->name( std::string( "invalid" ) ).size() + 1);
        pos += i->name( std::string( "invalid" ) ).size() + 1;
    }
}

The tag index

Using the vocabulary tag index we can have fast bidirectional mapping of tag names to integers; using the APT library and the package name index we can have fast bidirectional mapping of package names to integers. This means that we can encode the association of packages and tags using only fixed-length integer values.

The tag index contains two symmetrical indexes with the same format: the first one allows retrieval of an array of tag IDs from a package ID, and the second one allows retrieval of an array of package IDs given a tag ID.

The format of one int mapping is as follows:

[offset of mapping for item 0, offset of mapping for item 1...]
[size of array][sorted array of ints pointed by index 0]
[size of array][sorted array of ints pointed by index 1]
[size of array][sorted array of ints pointed by index 2]
[...]
[number of items in the mapping]

To get the list of tag IDs associated to a package ID, we use the package ID as the index of the offset table, and we obtain the offset of the tag IDs.

Using the tag IDs offset, we can read the size of the tag ID array and the array just afterwards.

This allows to perform various fast operations:

Lookup of a tag list from a package ID in O(1)
Fast computation of merged or intersected tag lists, by exploiting the fact that the arrays of tag IDs are sorted.

Since two symmetrical mappings are stored, this is all valid also from all operations that require getting the list of packages associated to a tag, or the merged or intersected lists of packages associated to a number of tags.

The code to access an int mapping is found in libtagcoll-dev in tagcoll/diskindex/int.h:

// Get the array of tag IDs associated to a package ID
// (or the array of package IDs associated to a tag ID, depending on
//  what index we are accessing)
const int* data(int val) const
{
    return (val >= 0 && (unsigned)val < size()) ? buf() + ofs(val) + 1 : 0;
}
// Get the size of an ID array
size_t size(int val) const
{
    return (val >= 0 && (unsigned)val < size()) ? buf()[ofs(val)] : 0;
}
// Get the number of mappings in the index
size_t size() const { return ofs(0); }

/**
 * Creates an on-disk index to use for IntIndex
 */
class IntIndexer : public MMapIndexer, public std::vector<std::set<int> >
{
public:
    /// Store the key->val mapping into the indexer
    void map(unsigned int key, int val)
    {
        if (size() <= key)
            resize(key + 1);
        (*this)[key].insert(val);
    }

    /// Return the size of the encoded index data
    virtual int encodedSize() const;

    /// Write the index data in the given buffer, which should be at least
    /// encodedSize bytes
    virtual void encode(char* buf) const
    {
        int* buf = (int*)cbuf;
        int pos = size();
        for (size_t i = 0; i < size(); i++)
        {
            buf[i] = pos;
            buf[pos++] = (*this)[i].size();
            for (set<int>::const_iterator j = (*this)[i].begin(); j != (*this)[i].end(); j++)
                buf[pos++] = *j;
        }
    }
};

The code to implement fast tag operations using the two symmetrical indexes is found in libtagcoll-dev in tagcoll/coll/intdiskindex.h and in libept-dev in ept/cache/debtags/tagmap.h.