Feature for SPARQL 1.2

July 6th, 2017 6:15 PM

Jindřich Mynarz recently posted a good list of “What I would like to see in SPARQL 1.2” and I thought I’d add a few comments as well as some of my own wished-for features.

Explicit ordering in GROUP_CONCAT, and quads support for both the HTTP Graph Store Protocol and CONSTRUCT queries (items 2, 5, and 8 in Jindřich’s list) seem like obvious improvements to SPARQL with a clear path forward for semantics and implementation.

Here are the some of the other wished-for features:

  • Explicitly specify the REDUCED modifier (#1)

    As an implementor, I quite like the fact that REDUCED is “underspecified.” It allows optimization opportunities that are much cheaper than a full DISTINCT would be, while still reducing result cardinality. I think it’s unfortunate that REDUCED hasn’t seen much use over the years, but I’m not sure what a better-specified REDUCED operator would do different from DISTINCT.

  • Property path quantifiers (#3)

    The challenge of supporting path quantifiers like elt{n,m} is figuring out what the result cardinality should be. The syntax for this was standardized during the development of SPARQL 1.1, but we couldn’t find consensus on whether elt{n,m} should act like a translation to an equivalent BGP/UNION pattern or like the arbitrary length paths (which do not introduce duplicate results). For small values of n and m, the translation approach seems natural, but as they grow, it’s not obvious that use cases would only want the translation semantics and not the non-duplicating semantics.

    Perhaps a new syntax could be developed which would allow the query author to indicate the desired cardinality semantics.

  • Date time/duration arithmetic functions (#6)

    This seems like a good idea, and very useful to some users, though it would substantially increase the size and number of the built-in functions and operators.

  • Support for non-scalar-producing aggregates (#9)

    I’m interested to see how this plays out as a SPARQL extension in systems like Stardog. It likely has a lot of interesting uses, but I worry that it would greatly complicate the query and data models, leading to calls to extend the semantics of RDF, and add new query forms, operators, and functions.

  • Structured serialization format for SPARQL queries (#10)

    I’m indifferent to this. I suspect some people would benefit from such a format, but I don’t think I’ve ever had need for one (where I couldn’t just parse a query myself and use the resulting AST) and it would be another format to support for implementors.

Beyond that, here are some other things I’d like to see worked on (either standardization, or cross-implementation support):

  • Support for window functions

  • Explicit support for named graphs in SERVICE blocks

    This can be partially accomplished right now for hard-coded graphs by using an endpoint url with the default-graph-uri query parameter, but I’d like more general support that could work dynamically with the active graph when the SERVICE block is evaluated.

  • Structured errors for use in the SPARQL Protocol

    My preference for this would be using the RFC7807 “Problem Details” JSON format, with a curated list of IRIs and associated metadata representing common error types (syntax errors, query-to-complex or too-many-requests refusals, etc.). There’s a lot of potential for smarter clients if errors contain structured data (e.g. SPARQL editors can highlight/fix syntax issues; clients could choose alternate data sources such as triple pattern fragments when the endpoint is overwhelmed).

SPARQL Limit by Resource

March 20th, 2017 7:17 PM

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Project(
    Filter(
        ?rank <= 2,
        Ungroup(
            Window(
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
                ?rank,
                Rank,
                (),
                (DESC(?age)),
            )
        )
    ),
    {?age, ?name, ?school}
)

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
HAVING (RANK(ORDER BY DESC(?age)) <= 2)

or:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
        }
    }
    FILTER(?rank <= 2)
}

Swifty Serd

February 17th, 2017 10:04 AM

I recently found myself wanting a simple and efficient way to parse RDF files in Swift, and ended up writing a small library that uses David Robillard’s excellent Serd library for RDF syntax. The resulting projects (available on GitHub) are swift-serd, a low-level wrapper around the C API, and SerdParser, a more Swift-y higher-level API that allows parsing of RDF content and handling the resulting triples.

The resulting code benefits from serd’s performance and standards-compliance, and allows very simple parsing of N-Triples and Turtle files:

import SerdParser

// extract all foaf:name triples
let parser = SerdParser()
let count = try parser.parse(file: filename) { (s, p, o) in
    if case .iri("http://xmlns.com/foaf/0.1/name") = p {
        print("\(s) has name \(o) .")
    }
}
print("\(count) triples processed")

Worried

November 8th, 2016 11:47 PM

I’m devastated and heartbroken. I’m worried about the people in my life who would not have health insurance if not for Obamacare. I’m worried for Muslims, immigrants, the disabled, women, and anyone else who will be on the receiving end of the harassment and intolerance that has been normalized by our President-elect. I’m worried that we will dangerously backtrack on progress made in fighting climate change, and waste time obstinately defending fossil fuels and opposing green alternatives. I’m worried about living in a world where truth and facts don’t seem to carry any weight, where science is viewed with skepticism, and experts are dismissed out of hand.

I’m worried, but resolved. Turns out, there’s a lot of work to do.

Election Day

November 8th, 2016 6:46 AM

President Obama:

We are not as divided as our politics suggests

I’m drawing inspiration from this. Regardless of what happens today, there will be a lot of work to do. I hope we can come together, with the understanding that we all want to make this country better. That will require listening to others, and trying to understand their concerns. And it will require being open to facts and evidence, even when they contradict beliefs. It’s election day. Let’s do this.

2015

December 31st, 2015 7:56 PM

At the airport for the last flight of 2015.

2015 Travel Map

Happy New Year!

ISWC 2014

October 28th, 2014 6:44 PM

Last week I attended ISWC 2014 in Riva del Garda and had a very nice time.

Riva del Garda

I presented a paper at the Developers Workshop about designing the new PerlRDF and SPARQLKit systems using trait-based programming. Covering more PerlRDF work, Kjetil presented his work on RDF::LinkedData.

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime by Kontchakov, et al. caught my attention during the main conference. The authors show a technique to map query answering under OWL QL entailment to pure SPARQL 1.1 queries under simple entailment, allowing OWL QL reasoning to be used with SPARQL 1.1 systems that do not otherwise support OWL. An important challenge highlighted by the work (and echoed by the audience after the presentation) was the limited performance of existing property path implementations. It seems as if there might be a subset of SPARQL 1.1 property paths (e.g. with limited nesting of path operators) that could be implemented more efficiently while providing enough expressivity to allow things like OWL reasoning.

There seemed to be a lot of interest in more flexible solutions to SPARQL query answering. Ruben’s Linked Data Fragments work addresses one part of this issue (though I remain somewhat skeptical about it as a general-purpose solution). That work was received very well at the conference, winning the Best Demo award. In conversations I had, there was also a lot of interest in the development of SPARQL systems that could handle capacity and scalability issues more gracefully (being smart enough to use appropriate HTTP error codes when a query cannot be answered due to system load or the complexity of the query, allowing a fallback to linked data fragments, for example).

A Year of Travel

December 19th, 2013 8:57 PM

I flew 52 flights in 2013, totaling 68,266 miles.

Air Tahiti Nui Wing

45 flights, and 60,817 of those miles were a part of our around-the-world trip.

2013 Travel Map

Upon returning to the US we moved to California, driving across the country in October.

You're not in Kansas anymore! foursquare badge

I expect there’ll be quite a bit less traveling in 2014, but hopefully just as much adventure.

IRI Encoding Troubles

July 10th, 2013 1:57 PM

Recently, I was tracking down a bug in RDF::Trine submitted by Jonas Smedegaard relating to the handling of unicode in IRIs. The bug involved the decoding of a punycode URI (which was an inadvertent “feature” that snuck in to RDF::Trine due to a feature in URI):

http://www.xn--hestebedgrd-58a.dk/

Strangely, this URI was being treated differently than other punycode URIs (e.g. http://xn--df-oiy.ws/), causing encoding errors in a completely different part of the code.

Upon investigating the problem, I determined that the punycode decoding code in the URI package treats punycode URIs differently depending on if they contain any unicode codepoints that cannot be represented in the Latin-1 encoding. For IRIs with all low-codepoint characters such as http://www.hestebedgård.dk, the perl value returned by the URI->as_iri method is a Latin-1 encoded string. High-codepoint IRIs such as http://✪df.ws/ are returned as UTF-8 encoded values. This difference caused problems when the values were subsequently passed into Redland’s C code that expected UTF-8 values.

As Gisle Aas, the author of URI, noted in the bug report I submitted on this issue, the simple fix is to call utf8::upgrade on values returned by URI->as_iri. While I believe this should be done in the as_iri method (as it is documented to return “a Unicode string”), until that is the case I hope this post might save some time and trouble for the next person to come across this issue.