Thoughts on HDT

November 29th, 2018 8:58 AM

I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

  • The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):

    Here are some of the issues I found during implementation:

    • �DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose

    • �FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)

    • The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits

    • “Log64” referenced in �BitmapTriples does not seem to be defined anywhere

    • There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup

  • The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding

  • There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible

  • The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

  • The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing

  • Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)

  • The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit

  • A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)

  • The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

Property Path use in Wikidata Queries

September 28th, 2018 9:06 AM

I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.

Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:

Pct. Count Number of Paths
74.3048% 58161337 0 paths used in query
24.7023% 19335490 1 paths used in query
0.6729% 526673 2 paths used in query
0.2787% 218186 4 paths used in query
0.0255% 19965 3 paths used in query
0.0056% 4387 7 paths used in query
0.0037% 2865 8 paths used in query
0.0030% 2327 9 paths used in query
0.0011% 865 6 paths used in query
0.0008% 604 11 paths used in query
0.0006% 434 5 paths used in query
0.0005% 398 10 paths used in query
0.0002% 156 12 paths used in query
0.0001% 110 15 paths used in query
0.0001% 101 19 paths used in query
0.0001% 56 13 paths used in query
0.0000% 12 14 paths used in query

I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths. The type of path operators used skews heavily towards * (ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs. Here are the structures representing at least 0.1% of the paths in the dataset:

Pct. Count Path Structure
49.3632% 10573772 ?s <iri1> * ?o .
39.8349% 8532772 ?s <iri1> / <iri2> ?o .
4.6857% 1003694 ?s <iri1> / ( <iri2> * ) ?o .
1.8983% 406616 ?s ( <iri1> + ) / ( <iri2> * ) ?o .
1.4626% 313290 ?s ( <iri1> * ) / <iri2> ?o .
1.1970% 256401 ?s ( ^ <iri1> ) / ( <iri2> * ) ?o .
0.7339% 157212 ?s <iri1> + ?o .
0.1919% 41110 ?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o .
0.1658% 35525 ?s <iri1> / <iri2> / <iri3> ?o .
0.1496% 32035 ?s <iri1> / ( <iri1> * ) ?o .
0.1124% 11889 ?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o .

There are also some rare but interesting uses of property paths in these logs:

Pct. Count Path Structure
0.0499% 5274 ?s ( ( <iri1> / ( <iri2> * ) ) / ( <iri3> / ( <iri2> * ) ) ) / ( <iri4> / ( <iri2> * ) ) ?o .
0.0015% 157 ?s ( <iri1> / <iri2> / <iri3> / <iri4> / <iri5> / <iri6> / <iri7> / <iri8> / <iri9> ) * ?o .
0.0003% 28 ?s ( ( ( ( <iri1> / <iri2> / <iri3> ) ? ) / ( <iri4> ? ) ) / ( <iri5> * ) ) / ( <iri6> / ( <iri7> ? ) ) ?o .

Without further investigation it’s hard to say if these represent meaningful queries or are just someone playing with SPARQL and/or Wikidata, but I found them curious.

  1. These numbers don’t align exactly with the Wikidata query dumps as there were some that I couldn’t parse with my tools. ↩︎

SPARQL Blank Nodes

August 5th, 2017 11:40 AM

I recently ran across what I believe to be a mistake in the SPARQL 1.1 Query Language, and thought I’d add some detail here.

Section 5.1.1 of the SPARQL 1.1 specification says:

When using blank nodes of the form _:abc, labels for blank nodes are scoped to the basic graph pattern. A label can be used in only a single basic graph pattern in any query.

However, during the parsing of a SPARQL property paths into the SPARQL algebra, many property path expressions do not result in basic graph patterns. Only those expressions that result in “adjacent triple patterns” produce a basic graph pattern. That means that a graph pattern such as:

_:s <p> ?x ;
    <q>* ?y .

does not result in a BGP. Intuitively, I think this should be allowed. It’s intention seems clear. However, it results in two primary algebraic components: a basic graph pattern with the triple pattern :s <p> ?x, and a property path :s ZeroOrMorePath(<q>) ?y. This certainly breaks the rule about only using blank nodes in a single basic graph pattern.

The language in section 5.1.1 originated in SPARQL 1.0, and I believe was just overlooked during the update to the language that added property paths.

When handling blank node labels, instead of following the exact language of the specification, I believe SPARQL implementations should instead allow blank node labels that appear in any adjacent set of basic graph patterns and property paths.

Andy Seaborne helpfully added this issue to the SPARQL 1.1 Errata.

Feature for SPARQL 1.2

July 6th, 2017 6:15 PM

Jindřich Mynarz recently posted a good list of “What I would like to see in SPARQL 1.2” and I thought I’d add a few comments as well as some of my own wished-for features.

Explicit ordering in GROUP_CONCAT, and quads support for both the HTTP Graph Store Protocol and CONSTRUCT queries (items 2, 5, and 8 in Jindřich’s list) seem like obvious improvements to SPARQL with a clear path forward for semantics and implementation.

Here are the some of the other wished-for features:

  • Explicitly specify the REDUCED modifier (#1)

    As an implementor, I quite like the fact that REDUCED is “underspecified.” It allows optimization opportunities that are much cheaper than a full DISTINCT would be, while still reducing result cardinality. I think it’s unfortunate that REDUCED hasn’t seen much use over the years, but I’m not sure what a better-specified REDUCED operator would do different from DISTINCT.

  • Property path quantifiers (#3)

    The challenge of supporting path quantifiers like elt{n,m} is figuring out what the result cardinality should be. The syntax for this was standardized during the development of SPARQL 1.1, but we couldn’t find consensus on whether elt{n,m} should act like a translation to an equivalent BGP/UNION pattern or like the arbitrary length paths (which do not introduce duplicate results). For small values of n and m, the translation approach seems natural, but as they grow, it’s not obvious that use cases would only want the translation semantics and not the non-duplicating semantics.

    Perhaps a new syntax could be developed which would allow the query author to indicate the desired cardinality semantics.

  • Date time/duration arithmetic functions (#6)

    This seems like a good idea, and very useful to some users, though it would substantially increase the size and number of the built-in functions and operators.

  • Support for non-scalar-producing aggregates (#9)

    I’m interested to see how this plays out as a SPARQL extension in systems like Stardog. It likely has a lot of interesting uses, but I worry that it would greatly complicate the query and data models, leading to calls to extend the semantics of RDF, and add new query forms, operators, and functions.

  • Structured serialization format for SPARQL queries (#10)

    I’m indifferent to this. I suspect some people would benefit from such a format, but I don’t think I’ve ever had need for one (where I couldn’t just parse a query myself and use the resulting AST) and it would be another format to support for implementors.

Beyond that, here are some other things I’d like to see worked on (either standardization, or cross-implementation support):

  • Support for window functions

  • Explicit support for named graphs in SERVICE blocks

    This can be partially accomplished right now for hard-coded graphs by using an endpoint url with the default-graph-uri query parameter, but I’d like more general support that could work dynamically with the active graph when the SERVICE block is evaluated.

  • Structured errors for use in the SPARQL Protocol

    My preference for this would be using the RFC7807 “Problem Details” JSON format, with a curated list of IRIs and associated metadata representing common error types (syntax errors, query-to-complex or too-many-requests refusals, etc.). There’s a lot of potential for smarter clients if errors contain structured data (e.g. SPARQL editors can highlight/fix syntax issues; clients could choose alternate data sources such as triple pattern fragments when the endpoint is overwhelmed).

SPARQL Limit by Resource

March 20th, 2017 7:17 PM

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Project(
    Filter(
        ?rank <= 2,
        Ungroup(
            Window(
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
                ?rank,
                Rank,
                (),
                (DESC(?age)),
            )
        )
    ),
    {?age, ?name, ?school}
)

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
HAVING (RANK(ORDER BY DESC(?age)) <= 2)

or:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
        }
    }
    FILTER(?rank <= 2)
}

Swifty Serd

February 17th, 2017 10:04 AM

I recently found myself wanting a simple and efficient way to parse RDF files in Swift, and ended up writing a small library that uses David Robillard’s excellent Serd library for RDF syntax. The resulting projects (available on GitHub) are swift-serd, a low-level wrapper around the C API, and SerdParser, a more Swift-y higher-level API that allows parsing of RDF content and handling the resulting triples.

The resulting code benefits from serd’s performance and standards-compliance, and allows very simple parsing of N-Triples and Turtle files:

import SerdParser

// extract all foaf:name triples
let parser = SerdParser()
let count = try parser.parse(file: filename) { (s, p, o) in
    if case .iri("http://xmlns.com/foaf/0.1/name") = p {
        print("\(s) has name \(o) .")
    }
}
print("\(count) triples processed")

Worried

November 8th, 2016 11:47 PM

I’m devastated and heartbroken. I’m worried about the people in my life who would not have health insurance if not for Obamacare. I’m worried for Muslims, immigrants, the disabled, women, and anyone else who will be on the receiving end of the harassment and intolerance that has been normalized by our President-elect. I’m worried that we will dangerously backtrack on progress made in fighting climate change, and waste time obstinately defending fossil fuels and opposing green alternatives. I’m worried about living in a world where truth and facts don’t seem to carry any weight, where science is viewed with skepticism, and experts are dismissed out of hand.

I’m worried, but resolved. Turns out, there’s a lot of work to do.

Election Day

November 8th, 2016 6:46 AM

President Obama:

We are not as divided as our politics suggests

I’m drawing inspiration from this. Regardless of what happens today, there will be a lot of work to do. I hope we can come together, with the understanding that we all want to make this country better. That will require listening to others, and trying to understand their concerns. And it will require being open to facts and evidence, even when they contradict beliefs. It’s election day. Let’s do this.

2015

December 31st, 2015 7:56 PM

At the airport for the last flight of 2015.

2015 Travel Map

Happy New Year!

ISWC 2014

October 28th, 2014 6:44 PM

Last week I attended ISWC 2014 in Riva del Garda and had a very nice time.

Riva del Garda

I presented a paper at the Developers Workshop about designing the new PerlRDF and SPARQLKit systems using trait-based programming. Covering more PerlRDF work, Kjetil presented his work on RDF::LinkedData.

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime by Kontchakov, et al. caught my attention during the main conference. The authors show a technique to map query answering under OWL QL entailment to pure SPARQL 1.1 queries under simple entailment, allowing OWL QL reasoning to be used with SPARQL 1.1 systems that do not otherwise support OWL. An important challenge highlighted by the work (and echoed by the audience after the presentation) was the limited performance of existing property path implementations. It seems as if there might be a subset of SPARQL 1.1 property paths (e.g. with limited nesting of path operators) that could be implemented more efficiently while providing enough expressivity to allow things like OWL reasoning.

There seemed to be a lot of interest in more flexible solutions to SPARQL query answering. Ruben’s Linked Data Fragments work addresses one part of this issue (though I remain somewhat skeptical about it as a general-purpose solution). That work was received very well at the conference, winning the Best Demo award. In conversations I had, there was also a lot of interest in the development of SPARQL systems that could handle capacity and scalability issues more gracefully (being smart enough to use appropriate HTTP error codes when a query cannot be answered due to system load or the complexity of the query, allowing a fallback to linked data fragments, for example).