behind the gist

thoughts on code, analysis and craft

Fast enclosing region lookup with JTS

I’ve been analyzing crime data and one of the standard tasks is to aggregate statistics by region. There are many different ways to define regions based on the user or use case. They can be things like police beats, city wards or census blocks. For all of these, aggregation requires us to be able to compute which polygons enclose a given point. It is relatively easy to do in PostGIS, but I needed the same behavior in a live streaming environment. I’ll show how easy this is in SQL and then how to achieve the same thing in Java using the Java Topology Suite.

Viewing Chicago crime in R

I’ve been using R to analyze Chicago crime data. It took me quite a while of tweaking to get the view I wanted, but I learned some things about plotting spatial data using ggplot2 that I wanted to share.

If you want to follow along, you will need to download the Chicago city boundary shapefile as well as the 2011 crime data. You’ll also need the maptools, rgeos and plyr packages, in addition to ggplot2.

Cooking classes, music lessons and Code Academy

Code Academy and Codecademy are getting a lot of attention these days. New York City Mayor Bloomberg recently jumped on the bandwagon as well.

I see the interest level as a great thing. In some ways, knowing basic coding skills is like knowing basic plumbing or electrical skills. There is a minimal competency level that will serve you well in life, especially as more of your life is captured digitally and made available electronically.

There is also a nice hobbyist vibe that I think is a logical evolution of things like cooking classes and music lessons. Today’s programming languages and development environments can make the learning process easy, not to mention support sites like GitHub and Stack Overflow that provide a rich source of recipes and resources for working through any failed experiments. You can just as easily share the results of your learning with others. Whipping up an animated birthday card using JavaScript and an HTML5 canvas element can be just as nice a gesture as cooking up paella for some friends.

But I’ve see a number of posts from non-technical founders taking classes like these to try to bootstrap their idea on their own. Expanding your vocabulary and knowledge never hurts, just don’t mistake superficial knowledge for talent.

Technical employees need knowledgeable and talented technical leaders. They need masters to help them learn their craft, not other novices. I can’t remember how many times Kitchen Nightmares showed a restaurant on the verge of failing because the owner had no clue how bad their own food was. Get a technical co-founder who not only has a discerning palate but also knows how to put ingredients together creatively and successfully. I know, the 37signals guys tell you to “Do it yourself first”, but that won’t get a non-technical founder very far.

By all means, take a class if you are in it for the fun of learning a new skill. Maybe you are contemplating a career change and think it can be more than a hobby - good for you! Even if you want to do it just to meet new people, great (although there are cheaper ways to do it). Maybe you hope to meet a co-founder that way. Just understand the other people taking the class generally aren’t going to fit the bill.

Software is a communication business. If you are non-technical, you still have a large responsibility in that communication process making sure the vision and customer needs are known within the organization. But when that communication needs to wind all the way down to instructions on a processor, find someone skilled in doing that.

Correct autocorrelation in Objective-C

After some algorithm design in R, I was ready to implement the result on iOS using the Accelerate framework. The R version already gave me some performance challenges and the Accelerate version had its own twists.

The equivalent R code that I wanted to implement was

fast autocorrelation
1
2
3
4
5
6
7
8
acf.II <- function(x)
{
  n <- length(x)
  x <- c(x, rep.int(0, n))
  transform <- fft(x)
  x <- fft(Conj(transform) * transform, inverse = TRUE)
  Re(x)[1:n]/(2*n)
}

At a high level, the correct Accelerate functions to give the same result are something like the following, but unit testing showed this wasn’t generating the correct result.

simplified Accelerate code
1
2
3
4
5
6
7
8
9
10
11
float dsp_acf_II_fft_unscaled(const FFTSetup setup, float const* signal, DSPSplitComplex* temp, float* acf, vDSP_Length n)
{
  ...
  vDSP_vclr(...); // pad with 0
  vDSP_ctoz(...); // convert to correct representation for fft
  vDSP_fft_zrip(..., FFT_FORWARD); // compute transform
  vDSP_zvcmul(...); // compute Conj(transform) * transform
  vDSP_fft_zrip(..., FFT_INVERSE); // inverse transform
  vDSP_ztoc(...); // convert back to the desired output representation
  ...
}

Fast autocorrelation in R

I was doing some signal processing in R a few weeks ago and was somewhat disappointed in how long it was taking to run my experiments. Digging around a little, it seemed like the vast majority of time was being taken up in autocorrelation and cross-correlation computations.

In particular, I was computing a “Type-II” autocorrelation, defined as

The inefficient way to compute this in directly in R is

brute force autocorrelation
1
2
3
4
5
acf.II <- function(signal)
{
  n <- length(signal)
  sapply(0:(n-1), function(lag) as.vector(signal[1:(n-lag)] %*% signal[(1+lag):n]))
}

But autocorrelation can be computed efficiently via FFT. You can get this efficient computation by calling the convolve function the right way.

autocorrelation using fft
1
2
3
4
5
acf.II <- function(signal)
{
  result <- convolve(signal,signal,type='open')
  result[-1:-length(result)/2]
}

But this version was still taking a long time and I had to poke into the convolve implementation to finally figure out why.