A beautiful article in the NYTimes contrasts abstract mathematics with the chilling reality of the Mexican drug cartel wars:
I was born in Mexico City, in a world that seems less and less familiar to me. I live now in the opposite corner of the continent. I am training to be a political scientist at Harvard. My passion has remained the afflictions of my homeland, but at Harvard I have found new ways to address them, to use mathematical models — matrices, vectors, equations, regressions — to understand the Mexican drug crisis.
The cartel wars are extremely violent, and the gangs are responsible for reprehensible kidnappings and deaths. They rank among the most deadly periods of organized crime in human history. The author’s goal isn’t to explain how she can analyze the wars from up in an ivory tower; it’s to describe how her mindset and toolkit inform her understanding of the world in any situation.
The article captured me because it never mentions what the author actually models. Instead, it presents her frightened thoughts and her efforts to calm herself by looking at the world through a mathematical lens. But it’s not what you think; there are no emotionally-distant mathematicians here. The author communicates her fascination with tying reality to abstract models, expecting and preempting the protest that reality is too complex and math too simple:
In this violent world, with the man in the blue Chevy whispering at me behind the window, math is my shield. Speaking up about drugs is in these parts a dangerous game. But not if you speak in the language of sigma and conditional expectations. Math protects me from the immediacy of the violence, and it protects me from them.
The beauty of my method lies in its simplicity. With mathematics I’m able to codify and simplify reality to make it manageable and, more important, malleable. I represent each possible individual as an equation in which each term symbolizes tastes, goals, profession and abilities. All people get portrayed: Policemen, politicians, citizens and drug cartels start living in this mathematical world as planes and hyperplanes and, as in real life, they interact and affect one another, sometimes colluding, sometimes colliding, sometimes neither.
I then use optimization to predict the form of interaction that will be the most probable to emerge and remain over time. Math starts speaking. It tells me, for example, under what conditions the outcome would be a drug war; when would the government prefer to cooperate with cartels; or when cruel intra-cartel purges will become the norm.
There is a part of every modeler’s mind which is constantly teasing out variables from constants. The statisticians among us may take a frequentist view, and wonder what would happen if a scene played itself out a million times; the programmers will deduce the underlying algorithms from the fuzzy result; the pure mathematicians will see manifolds everywhere:
In this abstract microcosmos, reality can be frozen or just slightly changed. I move and look at my hyperplanes from different angles. Let’s change the penalty code. No, let’s increase patrolling. Or reduce wages. Allow less contact between policemen and dealers. Assume the police force is corrupt. Assume it is not. I solve the equations and there it is. My answers come as Greek letters and probabilities.
But we all admit:
I know, I know, this is weird.
Ultimately, “free will” becomes the clarion of the independent. At least, it’s the best response to this explanation:
It may seem strange to examine this shadowy world with equations. But mathematics is transforming the social sciences. In the same way that physicists can predict the movement of atoms in space, we can use mathematics to model how individuals and groups will make decisions and interact in a society.
But free will has a (somewhat tentative) analogue in Heisenberg’s uncertainty principle, and with that philosophy and math (or theology and physics) are combined — but there’s been plenty of pop-sci written on that topic.
I found this brief article remarkable in how it was able to demonstrate the overlay mathematical thought on an extremely “human” subject without ever needing to explain either one.
(Via Drew Conway)
December 23, 2009 in Data
Walmart is running ads right now which claim that shoppers who spend more than $100 per week at the supermarket would save $650 a year by purchasing their groceries at the giant retailer instead.
That’s quite a jumble of conditionals and varying metrics: you have to first meet the requirements of shopping at a supermarket and spending over $100 per week; the savings are then presented in a completely different timeframe of one year. That works out to $12.50 a week, or a still sizable 12.5% discount.
Why not present it as 12.5%? The simple answer is that “$700″ is a substantial figure, and the marketing folks wanted to make people feel like they were saving more; conversely, $5200 a year on groceries sounds like a lot – better restate that as $100 per week. Depressingly, it occurs to me that many Americans may not know what to do with percentages.
Another key point is found in the wording of the ad – why target shoppers who spend more than $100 a week? If Walmart’s prices are really lower, then all shoppers should reap a benefit, not just the high rollers. Since I do not think Walmart is price discriminating (offering discounts only to people spending more than $100), I have to conclude that they restricted their dataset to increase the dollar value of the average person’s savings. If every shopper saved 12.5%, then the average annual dollar savings per person might be, say, $250. But if we consider only people who spend more than $100, the average dollar savings jumps to $650 even though the percent savings remains 12.5%. I would guess that $100 was chosen as a cutoff because a) it’s a round, friendly number which b) creates a relatively high average dollar savings while c) remaining low enough to be in reach of many American families. This, of course, is further evidence that Americans don’t understand percentages well (or at least, that marketers think they can fools us by avoiding them).
Note also that all of my calculations use the stated minimum figure of $100 vs the average figure of $650 to get the 12.5% discount. That’s not a real discount – someone spending $100 wouldn’t get $650 in savings, as that is the average of all the people spending more than $100. That person would realize a smaller dollar savings, and the real discount rate must therefore be less than 12.5%.
We all know that you can get some funny/interesting responses by typing the first part of a question into a major search engine’s search box and letting it suggest the remainder. The NYT has gone so far as to investigate those suggestions themselves. I particularly enjoyed their description of search engines as “modern confessionals:”
This labor-saving device — part fortuneteller, part shrink? — has opened a window into our collective soul. With millions of people pouring their hearts into this modern-day confessional, we get a direct, if mysterious, glimpse into the heads of our fellow Web surfers.
And some nice visualizations of the questions people are asking don’t hurt, either:
I’d love to see an interactive tool for creating these diagrams.
November 27, 2009 in Data
The NYT has published an infographic showing the top recipe searches on Allrecipes.com. Searches are broken out by state, allowing some interesting comparisons. (Local dialects and preferences are an interest of mine, and when combined with maps I can’t resist… see also various words for soda.)
Here’s the chart for “apple pie”, the 5th most popular search. Purple states had above-average search volume; orange states were below:
It’s not a particularly even distribution – and sent me looking for a Thanksgiving dish that was more uniformly enjoyed by all Americans. Unsurprisingly, that turned out to be “turkey,” the 14th most popular search. It’s graphic was a blend of muted purples and oranges, dispersed unevenly among the nation’s geography:
From there, I went searching for hyperlocal dishes or specialties. This would be much easier with the raw data, as a simple statistical test for dispersion and geographic correlation would toss up the winners – but it’s a testament to the NYT’s excellent graphics team that their visual maps serve the purpose just as well.
First up, sweet potatoes. The #1 search in the country was “sweet potato casserole,” with most of the searches concentrated in the southeast:

Clocking in at #15 was “sweet potato pie,” another another – even more strongly – southeast favorite:

Interestingly, though, sweet potatoes themselves formed a pretty uniform search pattern across the states – and, after turkey, get my vote for “most American dish”:

The dataset reveals two interesting facts about sweet potatoes. First, some people don’t spell too good:

Second, there’s a vocabulary difference, as many people out west prefer to call their sweet potatoes “yams” (I can’t back that up empirically, as they might want actual yams, but there is enough of a difference in dialect that many “yams” sold in the United States are required to state that they are also sweet potatoes on their packaging):

Moving on from those delicious root vegetables to another family, corn, reveals further geographic breakdowns. Here’s Midwestern favorite #18, corn casserole:
# 27: corn pudding, popular in the mid-Atlantic… and Alaska:
and #31 cornbread dressing in the south:

Meanwhile, new England likes its butternut squash:

By this point, you’re better off clicking through the actual graphic than staring at my reprints… I hope that all of TGR’s American readers had a happy Thanksgiving, regardless of what was on the table.
For a while, I’ve been following development of Indiemapper, a forthcoming web tool from the folks at Axis Maps. It should allow for easy map creation, including – yes – choropleths galore. However, the data analytics that will be available remain to be seen.
It’s been a while since I posted a video for the futurist set, so here we go: (This one is a commercial production for Freeband, heavy on the infographics and benefits of smart networking with a pinch of cheesiness. Sign me up.)
http://www.vimeo.com/7459305
(via Datavisualization.ch)
This morning, I was excited to see two of my interests collide as Nathan from FlowingData posted a tutorial for creating a choropleth: a map that uses color to convey values (I didn’t know that’s what they’re called either). He used county-level unemployment statistics to generate the following image:

However, the process appears quite intense, involving some python scripts and mucking around inside an SVG file. I half-heartedly wondered if there wasn’t a simpler way to create the image. And just then, along came David from Revolutions to throw down the gauntlet: could anyone come up with a way to replicate Nathan’s map in R?
David’s post pointed me toward R’s maps package, and off I went to start downloading the tools…
It took some time to coerce the BLS data into a compatible form; R don’t understand the FIPS county identifiers, so I had to jump through some hoops to get the strings to match (BLS uses state abbreviations; R wants full names. BLS puts in the words “county”, “parish” or “borough”, R doesn’t expect those to be passed. The BLS has a “Miami-Dade” county in Florida; R recognizes only “Dade”. Etc.) Ultimately, I used the following code to format the strings:
With the data in the correct format, I aligned a color vector with R’s list of counties and plotted the result:
It came out like this:

Not too bad, I think. It’s a little rough around the edges and a couple of counties are missing – I assume they are the ones with odd naming conventions (you’ll notice I manually adjusted Miami-Dade in my code). Also, I’m not sure how to bring Hawaii and Alaska into the picture. Moreover, the image doesn’t look too good in R itself. For example, I had given up on getting the county borders to show up as faint lines (I could only get them to be completely opaque) – imagine my surprise when I exported the chart and could see the borders just fine!
In any case, I wasn’t satisfied with this result. I’ve been experimenting with ggplot2 and remembered it had some mapping functions, so off I went to recreate the image with yet another library. Ggplot2 is an excellent general-purpose graphics library; the maps package feels positively last-gen after playing with ggplot2. It’s much more extensible and has many more parameters to experiment with – hard to believe it’s not the standard graphics package that ships with R (which itself is another last-gen experience).
Anyway, I kept the data formatted as above – which may have added an extra line or two to the ggplot2 code, but makes it simpler to jump back and forth – and used the following script to draw a new version of the map:
And the resulting image:

Again, a couple drawbacks: Alaska and Hawaii are nowhere to be seen and the borders are slightly aliased. The aliasing does make a difference, especially when compared to the maps output, but the ease with which I put together the latter graph and the frustration I experienced with the maps package, in my mind, more than erase that perceived shortcoming.
On the whole, I’d still take Nathan’s map over these as a finished product. However, I don’t think R can be beat for ease of use and all-in-one packageability – if I wanted, I could run regressions on the data, overlay my chart with more colors or new metrics, explode out certain counties or states… the possibilities are endless. With just a couple lines of code, I could overlay states the voted for Obama in blue, or highlight counties starting with the letter “C”. The static SVG method doesn’t allow any of that flexibility. Also, I’m completely confident that if I had any experience with these mapping packages – rather than using them for the first time tonight – I could mimic Nathan’s image perfectly.
The ggplot2 package, in particular, is fantastically powerful. I really wish I had discovered it sooner. As a matter of fact, Josh Reich runs a monthly R meetup for R users in the New York area and the next topic happens to be ggplot2 – it’ll be my first time attending, so I can’t really say what to expect, but I’m definitely looking forward to it.
Ben Fry has created a stunning image consisting of the 26 million roads in the United States (click to zoom):

Nothing other than asphalt (gravel, dirt…) has been drawn here, but geographic and political features emerge nonetheless. In a very real sense, the geography is a latent feature of the roads dataset, as it creates boundary conditions for the observable effect (that being the roads themselves). In other words, we see mountains, rivers, oceans, and the Canadian border because they are defined by contiguous regions without any streets.
Please see Ben’s project page for more information.
Inferred ratings and modelling teacher comments
June 24, 2009Another aspect of my conversation dealt with inferred ratings, a problem I’ve crossed before in other areas. There are two primary cases in which this arises: censored data and self-selection bias.
In the first case of censored data, a problem is caused by the ratings system not eliciting useful responses. An example is a system in [...]