Choropleths in R (yes, "choropleths")

November 12, 2009 in Data,Math

This morning, I was excited to see two of my interests collide as Nathan from FlowingData posted a tutorial for creating a choropleth: a map that uses color to convey values (I didn't know that's what they're called either). He used county-level unemployment statistics to generate the following image:

However, the process appears quite intense, involving some python scripts and mucking around inside an SVG file. I half-heartedly wondered if there wasn't a simpler way to create the image. And just then, along came David from Revolutions to throw down the gauntlet: could anyone come up with a way to replicate Nathan's map in R?

David's post pointed me toward R's maps package, and off I went to start downloading the tools...

It took some time to coerce the BLS data into a compatible form; R don't understand the FIPS county identifiers, so I had to jump through some hoops to get the strings to match (BLS uses state abbreviations; R wants full names. BLS puts in the words "county", "parish" or "borough", R doesn't expect those to be passed. The BLS has a "Miami-Dade" county in Florida; R recognizes only "Dade". Etc.) Ultimately, I used the following code to format the strings:

#load data
stateAbbr rownames(stateAbbr)unemp_data

#get county names in correct format
countyNames counties statesstates

#concatenate states and counties
unemp_data$counties

#parse out county titles & specifics
unemp_data$counties unemp_data$counties unemp_data$countiesunemp_data$counties

#define color buckets
colors = c("#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043")
unemp_data$colorBuckets 

With the data in the correct format, I aligned a color vector with R's list of counties and plotted the result:

library(maps)

#align data with map definitions
mapnamescolorsmatched

#draw map
map("county",col = colors[unemp_data$colorBuckets[match(mapnames ,unemp_data$counties)]],
    fill = TRUE,resolution = 0,lty = 0,projection = "polyconic")
map("state",col = "white",fill=FALSE,add=TRUE,lty=1,lwd=1,projection="polyconic")

It came out like this:

maps package result

Not too bad, I think. It's a little rough around the edges and a couple of counties are missing - I assume they are the ones with odd naming conventions (you'll notice I manually adjusted Miami-Dade in my code). Also, I'm not sure how to bring Hawaii and Alaska into the picture. Moreover, the image doesn't look too good in R itself. For example, I had given up on getting the county borders to show up as faint lines (I could only get them to be completely opaque) - imagine my surprise when I exported the chart and could see the borders just fine!

In any case, I wasn't satisfied with this result. I've been experimenting with ggplot2 and remembered it had some mapping functions, so off I went to recreate the image with yet another library. Ggplot2 is an excellent general-purpose graphics library; the maps package feels positively last-gen after playing with ggplot2. It's much more extensible and has many more parameters to experiment with - hard to believe it's not the standard graphics package that ships with R (which itself is another last-gen experience).

Anyway, I kept the data formatted as above - which may have added an extra line or two to the ggplot2 code, but makes it simpler to jump back and forth - and used the following script to draw a new version of the map:

library(ggplot2)

#extract reference data
mapcounties <- map_data("county")
mapstates <- map_data("state")

#merge data with ggplot county coordinates
mapcounties$county <- with(mapcounties , paste(region, subregion, sep = ","))
mergedata <- merge(mapcounties, unemp_data, by.x = "county", by.y = "counties")
mergedata <- mergedata[order(mergedata$order),]

#draw map
map <- ggplot(mergedata, aes(long,lat,group=group)) + geom_polygon(aes(fill=colorBuckets))
map <- map + scale_fill_brewer(palette="PuRd") +
    coord_map(project="globular") +
    opts(legend.position = "none")

#add state borders
map <- map + geom_path(data = mapstates, colour = "white", size = .75)

#add county borders
map <- map + geom_path(data = mapcounties, colour = "white", size = .5, alpha = .1)
map

And the resulting image:


ggplot2 package result

Again, a couple drawbacks: Alaska and Hawaii are nowhere to be seen and the borders are slightly aliased. The aliasing does make a difference, especially when compared to the maps output, but the ease with which I put together the latter graph and the frustration I experienced with the maps package, in my mind, more than erase that perceived shortcoming.

On the whole, I'd still take Nathan's map over these as a finished product. However, I don't think R can be beat for ease of use and all-in-one packageability - if I wanted, I could run regressions on the data, overlay my chart with more colors or new metrics, explode out certain counties or states... the possibilities are endless. With just a couple lines of code, I could overlay states the voted for Obama in blue, or highlight counties starting with the letter "C". The static SVG method doesn't allow any of that flexibility. Also, I'm completely confident that if I had any experience with these mapping packages - rather than using them for the first time tonight - I could mimic Nathan's image perfectly.

The ggplot2 package, in particular, is fantastically powerful. I really wish I had discovered it sooner. As a matter of fact, Josh Reich runs a monthly R meetup for R users in the New York area and the next topic happens to be ggplot2 - it'll be my first time attending, so I can't really say what to expect, but I'm definitely looking forward to it.

{ 17 comments… read them below or add one }

Hadley November 12, 2009 at 11:21 pm

I think the maps package does some thinning (generalisation) of the data which results in a slightly nicer appearance. I think if you look at the tip of Michigan, you can see a small difference between the state and county boundaries when you do this. I have some draft code to do this for general polygons, but it’s very hard to do it in a topologically consistent manner.

Nice plots :)

Reply

J November 13, 2009 at 12:23 am

Hadley – I was about to say thanks, because through you I discovered the embeddable code templates that I used here, but I just realized that I owe you much more because ggplot2 is actually yours! Knowing that, I’m not sure a simple thanks would do – it really is a phenomenal package (and I think I’m going to become its newest evangelist).

Reply

Joe November 14, 2009 at 8:20 am

I tried doing this a few months ago in python, got lazy and found this
http://www.sacmeq.org/statplanet/
Its all done in excel which makes it easy. Gives you a time scale on the bottom too.

Reply

Clem June 8, 2010 at 8:23 am

Hi,
I am trying to use your code to do a similar map of the US by county of the mortality rate caused by colon cancer. I am not a pro at R therefore i have a few question about your code.
how or where did you get the file “state abbr.csv” ?
And could briefly explain this line of code not sure i understand what “unemp” is

unemp_data$colorBuckets <- as.factor(as.numeric(cut(unemp_data$unemp,c(0,2,4,6,8,10,100))))

Thank you and great post extremely useful

Reply

J June 9, 2010 at 9:53 am

Hi Clem –
Sorry for the confusion. I guess my code isn’t quite as self-contained as I had thought.

“state abbr.csv” is a simple file I created to provide a mapping between state names and state abbreviations — as I recall, the data format and the necessary format in R were not the same. The file looked like this: New Jersey | NJ || New Mexico | NM || New York | NY || … etc. A very simple translation from one set of data into another.

“unemp_data” is my [unemp]loyment [data]set. To have R do any useful plotting, the plotted object needs to contain all plotted attributes — it’s very hard to overlay multiple objects intelligently. So, my main dataset needs to contain not just the data to be plotted, but extraneous items like the colors I want to plot in as well (note there are ways around this if you are willing to let R pick colors on your behalf). In this specific line, I am using the cut function to define “breaks” in the unemployment data of 2% width, capped at 100% to fill out the last bucket. So a level of 3% would be in one bucket and a level of 6% would be in another bucket. I convert those buckets to numbers (again, I don’t remember exactly why but most likely just as a data integrity step) and then to factors, which tells R to consider each value a distinct “level”. In other words, it won’t think of 1,2,3,4 as a continuous spectrum in which “2″ is near “3″ but far from “4″; rather, it will think of each number as a “level” and there is no relationship between “2″ and “3″ that implies greater similarity than between “2″ and “4″. Finally, I store my color factors in the dataset’s colorBuckets column. Later, I tell the plotting functions to give each of those buckets a different color, which I also specify.

Reply

Tim November 24, 2013 at 2:59 pm

Where are you seeing that line of code? I’m trying adapt this project however it seems like only excerpts of the code show up in the post. Example where does “mapnames” come from, what names are you assigning the variables in the unemployment dataset, thanks

Reply

Matthew July 16, 2010 at 3:20 am

Hi,
With respect to adding Hawaii and Alaska: Why not transform the latittude and and longitude values describing their boundaries so they are situated south of the mainland states as per Nathan’s map?
Nice work by the way.
Cheers

Matt Legge

Reply

jonmcrawford October 2, 2010 at 8:29 am

I was wondering if you could start with this type of data, then transform the plots into the persp() package to create a prism map? (intensity would not be just by color, but then elevation of the county in the image, with a consistent z-value for the entire county)

Not sure if that would work, but I was going to play around with that.

Reply

J October 3, 2010 at 11:56 am

Hi Jon — I’m pretty sure there is a way to export the contours of the map as a series of lat/long coordinates (though I can’t say off the bat how). You could pair that data with a z-variable to create the 3D graph. I would be remiss if I didn’t caution that 3D charts are frequently more about eye-candy and less about information; make sure that the move to 3D actually imparts greater meaning in addition to looking cool.

Please let us know if it works!

Reply

May February 15, 2013 at 12:55 pm

Just wondering if you ever found a way to create a prism map in R using persp()? I’ve been trying to do something similar, i.e. create a world map in which total mortality is represented as the elevation of a country, and a risk factor level (such as total cholesterol) is shown by intensity of color. It’s true that prism maps have their shortcomings and sometimes are just eye candy, but I’m interested in seeing how I could use this highlight some of the interesting relationships in my dataset.

Thanks for any suggestions!

Reply

thomas February 18, 2011 at 8:16 pm

hi,
i tried your code with my own data, but only on state level. Code works well and is truly easy but for whatever reason, some white area is visible in certain states. if you look at california …. http://img39.imageshack.us/i/65881540.jpg/
any idea what causes that? appreciated ;)

Reply

thomas February 19, 2011 at 9:01 am

found it. came from me forgetting to order the code, this line:
mergedata <- mergedata[order(mergedata$order),]

although i don't understand why not running it would lead to white gaps?

Reply

Miriam February 23, 2011 at 12:45 am

Hello!

I am trying to replicate your awesome graph, and I think I have almost everything working except this one line:

‘states states[1]
Autauga County, AL
“AL”
)

Thank You!

Reply

Miriam February 23, 2011 at 12:50 am

Sorry! For some reason, part of my post was chopped.
The line I am stumbling over is:

“states <- as.character(stateAbbr[states,2])"

from the previous line I obtain for "states" something like this (this is the first entry):

'states[1]
Autauga County, AL
“AL”'

What does "stateAbbr[states,2]" do, then?

Thank you so much!

Reply

Damian March 10, 2011 at 8:28 pm

I’m having trouble with this line, as well. It returns the following error:
Error in `[.default`(xj, i) : invalid subscript type ‘list’

Any tips?

Reply

Markus Jenning July 20, 2013 at 4:09 pm

Hey J,

first of all, thanks for all the effort you put into this.
I planned to work through your code and then give it a try using it my own data, but I happen not to understand the syntax in the first lines, e.g.:
stateAbbr rownames(stateAbbr)unemp_data
What is it that you’re trying to do there and why is there a space between stateAbbr and rownames(stateAbbr)unemp_data?
Sorry if this is too much of a newbie question, but I’d appreciate any support that would give me a better understanding of the R language.
Also, in the time that has passed since the challenge, have you developped your project any further? I’m planning to work with maps in a project soon, but I have not yet decided which package to use.

Cheers,
Markus

Reply

Lanora August 28, 2014 at 12:14 am

Greetings! I know this is kind of off topic
but I was wondering if you knew where I could locate a captcha plugin for my comment form?
I’m using the same blog platform as yours and I’m having trouble finding one?
Thanks a lot!

Reply

Leave a Comment

{ 7 trackbacks }

Previous post:

Next post: