behind the gist

thoughts on code, analysis and craft

Fast enclosing region lookup with JTS

I’ve been analyzing crime data and one of the standard tasks is to aggregate statistics by region. There are many different ways to define regions based on the user or use case. They can be things like police beats, city wards or census blocks. For all of these, aggregation requires us to be able to compute which polygons enclose a given point. It is relatively easy to do in PostGIS, but I needed the same behavior in a live streaming environment. I’ll show how easy this is in SQL and then how to achieve the same thing in Java using the Java Topology Suite.

SQL Version

With a crimes table having a coordinates (point) column and a census_blocks table having a region (multipolygon) column, the mapping is simple in SQL:

1
update crimes set block_id = (select id from census_blocks where ST_Covers(region, coordinates) limit 1);

The ST_Covers function will return true if the point is on the boundary or in the interior of the region (unlike ST_Contains which does not include the boundary). This means we can have multiple regions cover a point if the point happens to lie right on their boundary. We arbitrarily choose just one of them in this example.

Now you are all set to do whatever statistics and aggregation you want based on the block_id column.

JTS Version

For Java, we are going to set up a region index that behaves a little differently from the SQL case. We’ll use the index to store the regions identified by a string key. These regions could represent a single layer (such as census blocks), or multiple layers (blocks and police beats), it just depends on the key naming convention you choose.

The core parts of the index class are shown below.

core indexing functionality
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
private final GeometryFactory factory;
private final WKBReader reader;
private final STRtree regionIndex;

public void addRegion(String key, byte[] wkb) throws ParseException {
  Geometry geom = reader.read(wkb);
  IndexedRegion region = new IndexedRegion(key, geom);
  regionIndex.insert(geom.getEnvelopeInternal(), region);
}

public List<String> regionKeysCovering(Point point) {
  if (point.getSRID() != factory.getSRID()) {
    throw new IllegalArgumentException("SRID of query point must match SRID of index");
  }

  List<?> possibles = regionIndex.query(point.getEnvelopeInternal());
  List<String> regions = new ArrayList<>(possibles.size());
  for (Object o : possibles) {
    IndexedRegion region = (IndexedRegion) o;
    if (region.covers(point)) {
      regions.add(region.getKey());
    }
  }
  return regions;
}

The STRtree is one of two different spatial index classes provided in JTS. It provides minimum memory utilization but does not support dynamic updates. That is perfect for our case since our regions are largely pre-defined and hardly change. Also, note the spatial index classes in JTS do a first-level test based on the bounding box of the region. From the list of possible covers, we still need to determine those that really do cover the point.

The final wrinkle is how to handle the coordinate system (SRID) for the regions and points to make sure we are comparing apples and apples. Our index has a fixed coordinate system (specified via factory). In this case, we are using Well Known Binary (WKB) to represent the region polygons. The good thing about that format is that it is compact and widely supported. We can easily extract our regions from the database in WKB if we want. The downside is that the coordinate system is not encoded in WKB, just the geometry coordinates. Consequently, the caller has the responsibility of providing region WKB representations to addRegion which match the coordinate system provided in the factory. In regionKeysCovering, we can compare the point’s reference system with that of the factory to make sure they are the same.

As mentioned above, STRtree will only do the first level test. The results may not actually lie within the region. For the authoritative cover determination, we define an IndexedRegion whose instances we store in the STRtree.

IndexedRegion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
private static class IndexedRegion {
  private final String key;
  private final PreparedGeometry poly;

  public IndexedRegion(String key, Geometry poly) {
    this.key = key;
    this.poly = PreparedGeometryFactory.prepare(poly);
  }

  public String getKey() {
    return key;
  }

  public boolean covers(Point pt) {
    return poly.covers(pt);
  }
}

The important point about this implementation is that it uses a PreparedGeometry. “Plain” geometries in JTS support the same covers method, but aren’t optimized for our use case. We will be using the same geometry over and over again to perform the cover test with many different points. Plain geometries are implemented to support easy modification. PreparedGeometry, in contrast, is designed for the case where the geometry doesn’t change and will be used over and over again, just as we are. It uses extra time up-front to generate supplementary data enabling optimized spatial operator computations. In our case, using plain Geometry takes over six times longer to test a few million points than PreparedGeometry takes.

In the end, it is very simple to create an efficient streamed enclosing region detector with JTS. We just need to pay attention to some details based on the characteristics of our use case.

The full gist is available here.