Twenty Years

May 22nd, 2023 12:48 PM

I just got home from a weekend spent in Massachussetts at my twentieth college reunion.

Class of 2003

It was wonderful to be back at Wheaton, see old friends (some of whom actually live nearby these days), and make some new ones.

Chase at Dusk

Some things have changed. There are new buildings and new academic programs. But it was really great to see so much that hasn’t changed. The faculty, staff, and students; walks in the woods; campus on a spring afternoon.

Candles on Peacock Pond

The graduating seniors were friendly and welcoming, and spent a surprising amount of time with us. I hope they can look back on their time at Wheaton with as much fondness as I do.

Moving On…

December 23rd, 2022 2:31 PM

Clouds

A brief post to note that I’ve stopped checking or posting to Twitter. It was a huge part of my life for almost 16 years, but it seems clear that Twitter is headed in a direction that I no longer wish to support.

I’ve recently started following people from my long-dormant Mastodon account at @kasei@w3c.social, and can still be found elsewhere on the web and irc.

LRC on Trains

April 2nd, 2021 10:37 PM

I was listening to this afternoon’s Left Right and Center where the hosts discussed the proposed infrastructure plan, and specifically about Amtrak and investments in American rail infrastructure. Two arguments stuck out to me as way off base.

The first was Lanhee Chen’s claim about the relative environmental impact of rail transport. While seeming to be optimistic about investments into airports, he said:

I think rail is marginally maybe more environmentally friendly than flying by plane. If you look at certain measures of energy consumption it’s only marginally better.

Directing the argument to energy consumption feels like a weird direction to go when talking about environmental concerns. I would think the more direct comparison would be the environmental impact of the carbon emissions involved. On this, it’s impossible to take this argument seriously, as trains are simply better than flying. Even our current (terribly slow and inefficient) trains in the US are twice as efficient as flying economy. And more modern, efficient trains (which one would hope an infrastructural investment would aim for) are vastly more efficient. An article in the National Observer suggests that the “kilometers per tonne of climate pollution” for various methods of transport are: 5,000 for flying economy class, 10,000 for a North American train, and 81,000 for the Eurostar train. A 2019 BBC article came to much the same conclusion.

The second point was Josh Barro’s suggestion that the reason rail doesn’t make sense in America is due to our geography:

There shouldn’t be cross country passenger rail. The reason that we don’t have a national high speed rail network has to do with our geography. First of all people overestimate the extent to which these network exist in Europe. They do, but in France it’s basically all a radial network from Paris. There’s no line from Lyon to Bordeaux. The country is too large for it to make sense to run these large operations.

There’s some truth to this, but:

  • focusing on a single European country feels misleading when European rail networks are connected
  • focusing purely on high-speed rail ignores the many regional rail networks (even in France!) that complement the often “radial” high speed networks

And even putting high speed upgrades aside, data on Wikipedia suggests that the European Union has roughly the same sized rail network as the US (~200,000km), but with ~117,000 of that being electrified versus the US’s ~2,000. There are lots of opportunities for improvement here that don’t necessarily result in a “cross country â ¦ high speed rail network.”

And then there’s China. There may be other differences and issues with the Chinese rail network (certainly the political system that got it built), but in terms of “geography”, in under 15 years the Chinese have built more miles of high speed rail than the Amtrak network has in total. Maps of the Chinese HSR network look to me like they cover roughly a third of the country’s area. While that’s a lot smaller area than the area of the contiguous United States, it seems much larger than the combined area of megaregions of the US. Again, lots of opportunity for improvements.

So mostly it feels, as always, like this country lacks a viable and environmentally friendly high speed rail network due to a lack of will.

SPARQL* for Wikidata

August 12th, 2019 8:51 AM

I recently asked Olaf Hartig on twitter if he was aware of anyone using RDF* or SPARQL* for modeling qualified statements in Wikidata. These qualified statements are a feature of Wikidata that allow a statement such as “the speed limit in Germany is 100 km/h” to be qualified as applying only to “paved road outside of settlements.” Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph by Malyshev, et al. published last year at ISWC 2018 helps to visualize this data:

Visualization of Wikidata qualified statements

Although Olaf wasn’t aware of any work in this direction, I decided to look a bit into what the SPARQL* syntax might look like for Wikidata queries. Continuing with the speed limit example, we can query for German speed limits, and their qualifications:

SELECT ?speed ?qualifierLabel WHERE {
    wd:Q183
        wdt:P3086 ?speed ;
        p:P3086 [
            ps:P3086 ?speed ;
            pq:P3005 ?qualifier ;
        ] .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

This acts much like an RDF reification query. Using SPARQL* syntax to represent the same query, I ended up with:

SELECT ?speed ?qualifierLabel WHERE {
    << wd:Q183 wdt:P3086 ?speed >>
        pq:P3005 ?qualifier ;
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

This strikes me as a more appealing syntax for querying qualification statements, without requiring the repetition and understanding of the connection between wdt:P3086 and p:P3086. However, that repetition of “P3086” would still be required to access the quantityUnit and normalized values via the psi: and psn: predicate namespaces. I’m not familiar enough with the history of Wikidata to know why RDF reification wasn’t used in the modeling, but I think this shows that there are opportunities for improving the UX of the query interface (and possibly the RDF data model, especially if RDF* sees more widespread adoption in the future).

With minimal changes to my swift SPARQL parser, I made a proof-of-concept translator from Wikidata queries using SPARQL* syntax to standard SPARQL. It’s available in the sparql-star-wikidata branch, and as a docker image (kasei/swift-sparql-syntax:sparql-star-wikidata):

$ docker pull kasei/swift-sparql-syntax:sparql-star-wikidata
$ docker run -t kasei/swift-sparql-syntax:sparql-star-wikidata sparql-parser wikidata -S 'SELECT ?speed ?qualifierLabel WHERE { << wd:Q183 wdt:P3086 ?speed >> pq:P3005 ?qualifier ; SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }'
SELECT ?speed ?qualifierLabel WHERE {
    _:_blank.b1 <http://www.wikidata.org/prop/statement/P3086> ?speed .
    <http://www.wikidata.org/entity/Q183> <http://www.wikidata.org/prop/P3086> _:_blank.b1 .
    <http://www.wikidata.org/entity/Q183> <http://www.wikidata.org/prop/direct/P3086> ?speed .
    _:_blank.b1 <http://www.wikidata.org/prop/qualifier/P3005> ?qualifier .
    SERVICE <http://wikiba.se/ontology#label>
    {
        <http://www.bigdata.com/rdf#serviceParam> <http://wikiba.se/ontology#language> "en" .
    }
}

Thoughts on HDT

November 29th, 2018 8:58 AM

I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

  • The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):

    Here are some of the issues I found during implementation:

    • DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose

    • FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)

    • The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits

    • “Log64” referenced in BitmapTriples does not seem to be defined anywhere

    • There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup

  • The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding

  • There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible

  • The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

  • The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing

  • Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)

  • The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit

  • A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)

  • The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

Property Path use in Wikidata Queries

September 28th, 2018 9:06 AM

I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.

Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:

Pct. Count Number of Paths
74.3048% 58161337 0 paths used in query
24.7023% 19335490 1 paths used in query
0.6729% 526673 2 paths used in query
0.2787% 218186 4 paths used in query
0.0255% 19965 3 paths used in query
0.0056% 4387 7 paths used in query
0.0037% 2865 8 paths used in query
0.0030% 2327 9 paths used in query
0.0011% 865 6 paths used in query
0.0008% 604 11 paths used in query
0.0006% 434 5 paths used in query
0.0005% 398 10 paths used in query
0.0002% 156 12 paths used in query
0.0001% 110 15 paths used in query
0.0001% 101 19 paths used in query
0.0001% 56 13 paths used in query
0.0000% 12 14 paths used in query

I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths. The type of path operators used skews heavily towards * (ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs. Here are the structures representing at least 0.1% of the paths in the dataset:

Pct. Count Path Structure
49.3632% 10573772 ?s <iri1> * ?o .
39.8349% 8532772 ?s <iri1> / <iri2> ?o .
4.6857% 1003694 ?s <iri1> / ( <iri2> * ) ?o .
1.8983% 406616 ?s ( <iri1> + ) / ( <iri2> * ) ?o .
1.4626% 313290 ?s ( <iri1> * ) / <iri2> ?o .
1.1970% 256401 ?s ( ^ <iri1> ) / ( <iri2> * ) ?o .
0.7339% 157212 ?s <iri1> + ?o .
0.1919% 41110 ?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o .
0.1658% 35525 ?s <iri1> / <iri2> / <iri3> ?o .
0.1496% 32035 ?s <iri1> / ( <iri1> * ) ?o .
0.1124% 11889 ?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o .

There are also some rare but interesting uses of property paths in these logs:

Pct. Count Path Structure
0.0499% 5274 ?s ( ( <iri1> / ( <iri2> * ) ) / ( <iri3> / ( <iri2> * ) ) ) / ( <iri4> / ( <iri2> * ) ) ?o .
0.0015% 157 ?s ( <iri1> / <iri2> / <iri3> / <iri4> / <iri5> / <iri6> / <iri7> / <iri8> / <iri9> ) * ?o .
0.0003% 28 ?s ( ( ( ( <iri1> / <iri2> / <iri3> ) ? ) / ( <iri4> ? ) ) / ( <iri5> * ) ) / ( <iri6> / ( <iri7> ? ) ) ?o .

Without further investigation it’s hard to say if these represent meaningful queries or are just someone playing with SPARQL and/or Wikidata, but I found them curious.

  1. These numbers don’t align exactly with the Wikidata query dumps as there were some that I couldn’t parse with my tools. ↩︎

SPARQL Blank Nodes

August 5th, 2017 11:40 AM

I recently ran across what I believe to be a mistake in the SPARQL 1.1 Query Language, and thought I’d add some detail here.

Section 5.1.1 of the SPARQL 1.1 specification says:

When using blank nodes of the form _:abc, labels for blank nodes are scoped to the basic graph pattern. A label can be used in only a single basic graph pattern in any query.

However, during the parsing of a SPARQL property paths into the SPARQL algebra, many property path expressions do not result in basic graph patterns. Only those expressions that result in “adjacent triple patterns” produce a basic graph pattern. That means that a graph pattern such as:

_:s <p> ?x ;
    <q>* ?y .

does not result in a BGP. Intuitively, I think this should be allowed. It’s intention seems clear. However, it results in two primary algebraic components: a basic graph pattern with the triple pattern :s <p> ?x, and a property path :s ZeroOrMorePath(<q>) ?y. This certainly breaks the rule about only using blank nodes in a single basic graph pattern.

The language in section 5.1.1 originated in SPARQL 1.0, and I believe was just overlooked during the update to the language that added property paths.

When handling blank node labels, instead of following the exact language of the specification, I believe SPARQL implementations should instead allow blank node labels that appear in any adjacent set of basic graph patterns and property paths.

Andy Seaborne helpfully added this issue to the SPARQL 1.1 Errata.

Feature for SPARQL 1.2

July 6th, 2017 6:15 PM

Jindřich Mynarz recently posted a good list of “What I would like to see in SPARQL 1.2” and I thought I’d add a few comments as well as some of my own wished-for features.

Explicit ordering in GROUP_CONCAT, and quads support for both the HTTP Graph Store Protocol and CONSTRUCT queries (items 2, 5, and 8 in Jindřich’s list) seem like obvious improvements to SPARQL with a clear path forward for semantics and implementation.

Here are the some of the other wished-for features:

  • Explicitly specify the REDUCED modifier (#1)

    As an implementor, I quite like the fact that REDUCED is “underspecified.” It allows optimization opportunities that are much cheaper than a full DISTINCT would be, while still reducing result cardinality. I think it’s unfortunate that REDUCED hasn’t seen much use over the years, but I’m not sure what a better-specified REDUCED operator would do different from DISTINCT.

  • Property path quantifiers (#3)

    The challenge of supporting path quantifiers like elt{n,m} is figuring out what the result cardinality should be. The syntax for this was standardized during the development of SPARQL 1.1, but we couldn’t find consensus on whether elt{n,m} should act like a translation to an equivalent BGP/UNION pattern or like the arbitrary length paths (which do not introduce duplicate results). For small values of n and m, the translation approach seems natural, but as they grow, it’s not obvious that use cases would only want the translation semantics and not the non-duplicating semantics.

    Perhaps a new syntax could be developed which would allow the query author to indicate the desired cardinality semantics.

  • Date time/duration arithmetic functions (#6)

    This seems like a good idea, and very useful to some users, though it would substantially increase the size and number of the built-in functions and operators.

  • Support for non-scalar-producing aggregates (#9)

    I’m interested to see how this plays out as a SPARQL extension in systems like Stardog. It likely has a lot of interesting uses, but I worry that it would greatly complicate the query and data models, leading to calls to extend the semantics of RDF, and add new query forms, operators, and functions.

  • Structured serialization format for SPARQL queries (#10)

    I’m indifferent to this. I suspect some people would benefit from such a format, but I don’t think I’ve ever had need for one (where I couldn’t just parse a query myself and use the resulting AST) and it would be another format to support for implementors.

Beyond that, here are some other things I’d like to see worked on (either standardization, or cross-implementation support):

  • Support for window functions

  • Explicit support for named graphs in SERVICE blocks

    This can be partially accomplished right now for hard-coded graphs by using an endpoint url with the default-graph-uri query parameter, but I’d like more general support that could work dynamically with the active graph when the SERVICE block is evaluated.

  • Structured errors for use in the SPARQL Protocol

    My preference for this would be using the RFC7807 “Problem Details” JSON format, with a curated list of IRIs and associated metadata representing common error types (syntax errors, query-to-complex or too-many-requests refusals, etc.). There’s a lot of potential for smarter clients if errors contain structured data (e.g. SPARQL editors can highlight/fix syntax issues; clients could choose alternate data sources such as triple pattern fragments when the endpoint is overwhelmed).

SPARQL Limit by Resource

March 20th, 2017 7:17 PM

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Project(
    Filter(
        ?rank <= 2,
        Ungroup(
            Window(
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
                ?rank,
                Rank,
                (),
                (DESC(?age)),
            )
        )
    ),
    {?age, ?name, ?school}
)

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
HAVING (RANK(ORDER BY DESC(?age)) <= 2)

or:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
        }
    }
    FILTER(?rank <= 2)
}

Swifty Serd

February 17th, 2017 10:04 AM

I recently found myself wanting a simple and efficient way to parse RDF files in Swift, and ended up writing a small library that uses David Robillard’s excellent Serd library for RDF syntax. The resulting projects (available on GitHub) are swift-serd, a low-level wrapper around the C API, and SerdParser, a more Swift-y higher-level API that allows parsing of RDF content and handling the resulting triples.

The resulting code benefits from serd’s performance and standards-compliance, and allows very simple parsing of N-Triples and Turtle files:

import SerdParser

// extract all foaf:name triples
let parser = SerdParser()
let count = try parser.parse(file: filename) { (s, p, o) in
    if case .iri("http://xmlns.com/foaf/0.1/name") = p {
        print("\(s) has name \(o) .")
    }
}
print("\(count) triples processed")