Moving Past PostgreSQL
Well, PostgreSQL didn’t scale well, either. It hit a wall above 3 million rows (data from New England, California and Alaska). I’m not well versed in how to tune PostgreSQL, even for something simple like “use all the memory you can”, but I’m not optimistic it would have mattered. At some point I’d like to go back and look into PostGIS, but I’ve ventured away from the RDBMS model for the moment, and I won’t return until I explore more fully the current implementation using an R-Tree [Guttman84].
Perhaps most exciting in developing an R-Tree for my project with the Tiger/Line data comes from the paper Nearest Neighbor Queries:
A frequently encountered type of query in Geographic Information Systems is to find the k nearest neighbor objects to a given point in space. Processing such queries requires substantially different search algorithms than those for location or range queries. In this paper we present an efficient branch-and-bound R-tree traversal algorithm to find the nearest neighbor object to a point, and then generalize it to finding the k nearest neighbors.
As it turns out, the work described in the paper uses Tiger/Line data for their research.
I’m still working on implementing the nearest-neighbor search algorithm, but I’ve got a working prototype implementation of the R-Tree structure in Perl with intersection queries (which end up working well for densely populated areas of the Tiger/Line data). I’ve got a packed serialization format for the tree which I can currently freeze and thaw, and I’m hoping at some point soon to be able to avoid any object instantiation and simply mmap
the entire tree structure. I’ve already moved some of the implementation over to C for speed reasons, and the mmap
idea would work well with an entirely C-based implementation.