Property Path use in Wikidata Queries
I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.
Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:
Pct. | Count | Number of Paths |
---|---|---|
74.3048% | 58161337 | 0 paths used in query |
24.7023% | 19335490 | 1 paths used in query |
0.6729% | 526673 | 2 paths used in query |
0.2787% | 218186 | 4 paths used in query |
0.0255% | 19965 | 3 paths used in query |
0.0056% | 4387 | 7 paths used in query |
0.0037% | 2865 | 8 paths used in query |
0.0030% | 2327 | 9 paths used in query |
0.0011% | 865 | 6 paths used in query |
0.0008% | 604 | 11 paths used in query |
0.0006% | 434 | 5 paths used in query |
0.0005% | 398 | 10 paths used in query |
0.0002% | 156 | 12 paths used in query |
0.0001% | 110 | 15 paths used in query |
0.0001% | 101 | 19 paths used in query |
0.0001% | 56 | 13 paths used in query |
0.0000% | 12 | 14 paths used in query |
I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths.
The type of path operators used skews heavily towards *
(ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs.
Here are the structures representing at least 0.1% of the paths in the dataset:
Pct. | Count | Path Structure |
---|---|---|
49.3632% | 10573772 | ?s <iri1> * ?o . |
39.8349% | 8532772 | ?s <iri1> / <iri2> ?o . |
4.6857% | 1003694 | ?s <iri1> / ( <iri2> * ) ?o . |
1.8983% | 406616 | ?s ( <iri1> + ) / ( <iri2> * ) ?o . |
1.4626% | 313290 | ?s ( <iri1> * ) / <iri2> ?o . |
1.1970% | 256401 | ?s ( ^ <iri1> ) / ( <iri2> * ) ?o . |
0.7339% | 157212 | ?s <iri1> + ?o . |
0.1919% | 41110 | ?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o . |
0.1658% | 35525 | ?s <iri1> / <iri2> / <iri3> ?o . |
0.1496% | 32035 | ?s <iri1> / ( <iri1> * ) ?o . |
0.1124% | 11889 | ?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o . |
There are also some rare but interesting uses of property paths in these logs:
Pct. | Count | Path Structure |
---|---|---|
0.0499% | 5274 | ?s ( ( <iri1> / ( <iri2> * ) ) / ( <iri3> / ( <iri2> * ) ) ) / ( <iri4> / ( <iri2> * ) ) ?o . |
0.0015% | 157 | ?s ( <iri1> / <iri2> / <iri3> / <iri4> / <iri5> / <iri6> / <iri7> / <iri8> / <iri9> ) * ?o . |
0.0003% | 28 | ?s ( ( ( ( <iri1> / <iri2> / <iri3> ) ? ) / ( <iri4> ? ) ) / ( <iri5> * ) ) / ( <iri6> / ( <iri7> ? ) ) ?o . |
Without further investigation it’s hard to say if these represent meaningful queries or are just someone playing with SPARQL and/or Wikidata, but I found them curious.
-
These numbers don’t align exactly with the Wikidata query dumps as there were some that I couldn’t parse with my tools. ↩︎