behind the gist

thoughts on code, analysis and craft

Viewing Chicago crime in R

I’ve been using R to analyze Chicago crime data. It took me quite a while of tweaking to get the view I wanted, but I learned some things about plotting spatial data using ggplot2 that I wanted to share.

If you want to follow along, you will need to download the Chicago city boundary shapefile as well as the 2011 crime data. You’ll also need the maptools, rgeos and plyr packages, in addition to ggplot2.

Shapefiles

First, there are lots of R packages for handling spatial data and many different ways to read Shapefiles and plot them. In the end, I figured out how to get map tools and ggplot2 working together.

read shapefile
1
2
3
  city <- readShapePoly('chicago/City_Boundary.shp')
  city <- fortify(gSimplify(city, tol=100), region='OBJECTID')
  city <- rename(city, c(long="x", lat="y"))

We’re doing a few things to the data we read in. First is that we are simplifying it using gSimplify. The original Shapefile from the city is very detailed and has over 12,000 points in it. For our simple plotting purposes, that is overkill. The coordinates are in feet, so we are are giving a tolerance of 100 feet.

Second, we are fortifying the polygon data so ggplot2 knows how to group the spatial data. The Chicago boundary actually has holes in it - areas enclosed by the city that are not part of the city proper. The Shapefile therefore has multiple paths in it. If you try to plot the data without grouping it, ggplot2 will draw ugly lines connecting all those paths. You also need fortify when you have multiple regions in your data and want to plot them with different aesthetics.

Finally, the Shapefile is in X/Y, not lat/lon, but fortify creates the names long and lat, so we change them to avoid confusion.

Data Prep

The data itself needs a little bit of cleaning. It would need quite a bit more for different types of plots, but we are only interested in plotting by crime type so that is what we’ll focus on.

read crime data
1
2
3
4
5
6
  crimes <- read.csv('chicago/Crimes_-_2011.csv', as.is = TRUE)
  crimes$Primary.Type[crimes$Primary.Type == 'INTERFERE WITH PUBLIC OFFICER'] <- 'INTERFERENCE WITH PUBLIC OFFICER'
  crimes$Primary.Type[crimes$Primary.Type == 'OFFENSES INVOLVING CHILDREN'] <- 'OFFENSE INVOLVING CHILDREN'
	
  primary_factors <-rev(arrange(count(crimes, 'Primary.Type'), freq)[,1])
  crimes$Primary.Type <- factor(crimes$Primary.Type, primary_factors, ordered=TRUE)

The first thing we do is normalize the names of the primary crime type category. Then we use some manipulation tools from plyr to order those by descending frequency of occurrence in the data. Creating the factor this way helps order our plots when we facet them later.

Plotting

Now we are ready to plot our data.

plot crime
1
2
3
4
5
6
7
8
9
10
plot_crime <- function(crimes, city, min_rank=1, max_rank=30, ncol=6)
{
  ranks = subset(crimes, as.integer(Primary.Type) >= min_rank & as.integer(Primary.Type) <= max_rank)
  ggplot(data=ranks, aes(X.Coordinate, Y.Coordinate)) +
  geom_path(data=city, aes(x, y, group=group), size=.2, colour='grey80') +
  geom_point(size=.2) +
  facet_wrap(~ Primary.Type, ncol=ncol) +
  coord_equal(ratio=1) +
  no_grid_opts()
}

A few things to point out. First, the ordering we did of the Primary.Type factor now lets us look at a desired range of crime types by rank with subset and facet_wrap. Second, the crime data has both X/Y coordinates and lat/lon, but we are using X/Y to match the Shapefile coordinates. Finally, no_grid_opts isn’t shown, but makes our facet grid more readable.

Now we can generate plots like below. The spatial distributions are fascinating. Some types clearly have greater prevalence at intersections, bus or train stops, or along streets - prostitution being the most striking of these. Others seem to have regional clusters like narcotics and gambling. You can see that theft is very concentrated in the downtown area, but narcotics and burglary are not.

This just scratches the surface of what you can do with this data set but I hope that is enough background so you can dig into the other parts as well.

The full gist is available here.