I’ve been analyzing crime data and one of the standard tasks is to aggregate statistics by region. There are many different ways to define regions based on the user or use case. They can be things like police beats, city wards or census blocks. For all of these, aggregation requires us to be able to compute which polygons enclose a given point. It is relatively easy to do in PostGIS, but I needed the same behavior in a live streaming environment. I’ll show how easy this is in SQL and then how to achieve the same thing in Java using the Java Topology Suite.
SQL Version
With a crimes
table having a coordinates
(point) column and a census_blocks
table having a region
(multipolygon) column, the mapping is simple in SQL:
1
|
|
The ST_Covers
function will return true if the point is on the boundary or in the interior of the region (unlike ST_Contains
which does not include the boundary). This means we can have multiple regions cover a point if the point happens to lie right on their boundary. We arbitrarily choose just one of them in this example.
Now you are all set to do whatever statistics and aggregation you want based on the block_id
column.
JTS Version
For Java, we are going to set up a region index that behaves a little differently from the SQL case. We’ll use the index to store the regions identified by a string key. These regions could represent a single layer (such as census blocks), or multiple layers (blocks and police beats), it just depends on the key naming convention you choose.
The core parts of the index class are shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
The STRtree
is one of two different spatial index classes provided in JTS. It provides minimum memory utilization but does not support dynamic updates. That is perfect for our case since our regions are largely pre-defined and hardly change. Also, note the spatial index classes in JTS do a first-level test based on the bounding box of the region. From the list of possible covers, we still need to determine those that really do cover the point.
The final wrinkle is how to handle the coordinate system (SRID) for the regions and points to make sure we are comparing apples and apples. Our index has a fixed coordinate system (specified via factory
). In this case, we are using Well Known Binary (WKB) to represent the region polygons. The good thing about that format is that it is compact and widely supported. We can easily extract our regions from the database in WKB if we want. The downside is that the coordinate system is not encoded in WKB, just the geometry coordinates. Consequently, the caller has the responsibility of providing region WKB representations to addRegion
which match the coordinate system provided in the factory. In regionKeysCovering
, we can compare the point’s reference system with that of the factory to make sure they are the same.
As mentioned above, STRtree
will only do the first level test. The results may not actually lie within the region. For the authoritative cover determination, we define an IndexedRegion
whose instances we store in the STRtree
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The important point about this implementation is that it uses a PreparedGeometry
. “Plain” geometries in JTS support the same covers
method, but aren’t optimized for our use case. We will be using the same geometry over and over again to perform the cover test with many different points. Plain geometries are implemented to support easy modification. PreparedGeometry
, in contrast, is designed for the case where the geometry doesn’t change and will be used over and over again, just as we are. It uses extra time up-front to generate supplementary data enabling optimized spatial operator computations. In our case, using plain Geometry
takes over six times longer to test a few million points than PreparedGeometry
takes.
In the end, it is very simple to create an efficient streamed enclosing region detector with JTS. We just need to pay attention to some details based on the characteristics of our use case.
The full gist is available here.