Gregory Todd Williams: Thoughts on HDT

Thoughts on HDT

November 29^th, 2018 8:58 AM

I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):

Here are some of the issues I found during implementation:
- DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose
- FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)
- The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits
- “Log64” referenced in BitmapTriples does not seem to be defined anywhere
- There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup
The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding
There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible
The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing
Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)
The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit
A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)
The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

#Semantic Web

Next: SPARQL* for Wikidata
Previous: Property Path use in Wikidata Queries