Posts tagged as:

Data

The mathematician’s lens

January 25, 2010 in Data, Math

A beautiful article in the NYTimes contrasts abstract mathematics with the chilling reality of the Mexican drug cartel wars:

I was born in Mexico City, in a world that seems less and less familiar to me. I live now in the opposite corner of the continent. I am training to be a political scientist at Harvard. My passion has remained the afflictions of my homeland, but at Harvard I have found new ways to address them, to use mathematical models — matrices, vectors, equations, regressions — to understand the Mexican drug crisis.

The cartel wars are extremely violent, and the gangs are responsible for reprehensible kidnappings and deaths. They rank among the most deadly periods of organized crime in human history. The author’s goal isn’t to explain how she can analyze the wars from up in an ivory tower; it’s to describe how her mindset and toolkit inform her understanding of the world in any situation.

The article captured me because it never mentions what the author actually models. Instead, it presents her frightened thoughts and her efforts to calm herself by looking at the world through a mathematical lens. But it’s not what you think; there are no emotionally-distant mathematicians here. The author communicates her fascination with tying reality to abstract models, expecting and preempting the protest that reality is too complex and math too simple:

In this violent world, with the man in the blue Chevy whispering at me behind the window, math is my shield. Speaking up about drugs is in these parts a dangerous game. But not if you speak in the language of sigma and conditional expectations. Math protects me from the immediacy of the violence, and it protects me from them.

The beauty of my method lies in its simplicity. With mathematics I’m able to codify and simplify reality to make it manageable and, more important, malleable. I represent each possible individual as an equation in which each term symbolizes tastes, goals, profession and abilities. All people get portrayed: Policemen, politicians, citizens and drug cartels start living in this mathematical world as planes and hyperplanes and, as in real life, they interact and affect one another, sometimes colluding, sometimes colliding, sometimes neither.

I then use optimization to predict the form of interaction that will be the most probable to emerge and remain over time. Math starts speaking. It tells me, for example, under what conditions the outcome would be a drug war; when would the government prefer to cooperate with cartels; or when cruel intra-cartel purges will become the norm.

There is a part of every modeler’s mind which is constantly teasing out variables from constants. The statisticians among us may take a frequentist view, and wonder what would happen if a scene played itself out a million times; the programmers will deduce the underlying algorithms from the fuzzy result; the pure mathematicians will see manifolds everywhere:

In this abstract microcosmos, reality can be frozen or just slightly changed. I move and look at my hyperplanes from different angles. Let’s change the penalty code. No, let’s increase patrolling. Or reduce wages. Allow less contact between policemen and dealers. Assume the police force is corrupt. Assume it is not. I solve the equations and there it is. My answers come as Greek letters and probabilities.

But we all admit:

I know, I know, this is weird.

Ultimately, “free will” becomes the clarion of the independent. At least, it’s the best response to this explanation:

It may seem strange to examine this shadowy world with equations. But mathematics is transforming the social sciences. In the same way that physicists can predict the movement of atoms in space, we can use mathematics to model how individuals and groups will make decisions and interact in a society.

But free will has a (somewhat tentative) analogue in Heisenberg’s uncertainty principle, and with that philosophy and math (or theology and physics) are combined — but there’s been plenty of pop-sci written on that topic.

I found this brief article remarkable in how it was able to demonstrate the overlay mathematical thought on an extremely “human” subject without ever needing to explain either one.

(Via Drew Conway)

{ 0 comments }

Data wars

January 11, 2010 in Data

The NYT writes about the military’s data problem:

Air Force drones collected nearly three times as much video over Afghanistan and Iraq last year as in 2007 — about 24 years’ worth if watched continuously. That volume is expected to multiply in the coming years as drones are added to the fleet and as some start using multiple cameras to shoot in many directions.

A very interesting read for the dataheads among us. The comparison to football broadcasts also caught my eye – televised sports are so frequently compared to battles and war, and here we see the army coming to the athletes for advice:

But while the biggest timesaver would be to automatically scan the video for trucks and armed men, that software is not yet reliable. And the military has run into the same problem that the broadcast industry has in trying to pick out football players swarming on a tackle.

So Cmdr. Joseph A. Smith, a Navy officer assigned to the National Geospatial-Intelligence Agency, which sets standards for video intelligence, said he and other officials had climbed into broadcast trucks outside football stadiums to learn how the networks tagged and retrieved highlight film.

{ 0 comments }

Walmart ad math

December 23, 2009 in Data

Walmart is running ads right now which claim that shoppers who spend more than $100 per week at the supermarket would save $650 a year by purchasing their groceries at the giant retailer instead.

That’s quite a jumble of conditionals and varying metrics: you have to first meet the requirements of shopping at a supermarket and spending over $100 per week; the savings are then presented in a completely different timeframe of one year. That works out to $12.50 a week, or a still sizable 12.5% discount.

Why not present it as 12.5%? The simple answer is that “$700″ is a substantial figure, and the marketing folks wanted to make people feel like they were saving more; conversely, $5200 a year on groceries sounds like a lot – better restate that as $100 per week. Depressingly, it occurs to me that many Americans may not know what to do with percentages.

Another key point is found in the wording of the ad – why target shoppers who spend more than $100 a week? If Walmart’s prices are really lower, then all shoppers should reap a benefit, not just the high rollers. Since I do not think Walmart is price discriminating (offering discounts only to people spending more than $100), I have to conclude that they restricted their dataset to increase the dollar value of the average person’s savings. If every shopper saved 12.5%, then the average annual dollar savings per person might be, say, $250. But if we consider only people who spend more than $100, the average dollar savings jumps to $650 even though the percent savings remains 12.5%. I would guess that $100 was chosen as a cutoff because a) it’s a round, friendly number which b) creates a relatively high average dollar savings while c) remaining low enough to be in reach of many American families. This, of course, is further evidence that Americans don’t understand percentages well (or at least, that marketers think they can fools us by avoiding them).

Note also that all of my calculations use the stated minimum figure of $100 vs the average figure of $650 to get the 12.5% discount. That’s not a real discount – someone spending $100 wouldn’t get $650 in savings, as that is the average of all the people spending more than $100. That person would realize a smaller dollar savings, and the real discount rate must therefore be less than 12.5%.

{ 0 comments }

Modern confessionals

December 22, 2009 in Data, Internet

We all know that you can get some funny/interesting responses by typing the first part of a question into a major search engine’s search box and letting it suggest the remainder. The NYT has gone so far as to investigate those suggestions themselves. I particularly enjoyed their description of search engines as “modern confessionals:”

This labor-saving device — part fortuneteller, part shrink? — has opened a window into our collective soul. With millions of people pouring their hearts into this modern-day confessional, we get a direct, if mysterious, glimpse into the heads of our fellow Web surfers.

And some nice visualizations of the questions people are asking don’t hurt, either:

I’d love to see an interactive tool for creating these diagrams.

{ 0 comments }

The NYT has published an infographic showing the top recipe searches on Allrecipes.com. Searches are broken out by state, allowing some interesting comparisons. (Local dialects and preferences are an interest of mine, and when combined with maps I can’t resist… see also various words for soda.)

Here’s the chart for “apple pie”, the 5th most popular search. Purple states had above-average search volume; orange states were below:

apple pieIt’s not a particularly even distribution – and sent me looking for a Thanksgiving dish that was more uniformly enjoyed by all Americans. Unsurprisingly, that turned out to be “turkey,” the 14th most popular search. It’s graphic was a blend of muted purples and oranges, dispersed unevenly among the nation’s geography:

turkeyFrom there, I went searching for hyperlocal dishes or specialties. This would be much easier with the raw data, as a simple statistical test for dispersion and geographic correlation would toss up the winners – but it’s a testament to the NYT’s excellent graphics team that their visual maps serve the purpose just as well.

First up, sweet potatoes. The #1 search in the country was “sweet potato casserole,” with most of the searches concentrated in the southeast:

sweet potato casserole

Clocking in at #15 was “sweet potato pie,” another another – even more strongly – southeast favorite:

sweet potato pie

Interestingly, though, sweet potatoes themselves formed a pretty uniform search pattern across the states – and, after turkey, get my vote for “most American dish”:

sweet potato

The dataset reveals two interesting facts about sweet potatoes. First, some people don’t spell too good:

sweet potato casserole 2

Second, there’s a vocabulary difference, as many people out west prefer to call their sweet potatoes “yams” (I can’t back that up empirically, as they might want actual yams, but there is enough of a difference in dialect that many “yams” sold in the United States are required to state that they are also sweet potatoes on their packaging):

yams

Moving on from those delicious root vegetables to another family, corn, reveals further geographic breakdowns. Here’s Midwestern favorite #18, corn casserole:

corn casserole# 27: corn pudding, popular in the mid-Atlantic… and Alaska:

corn puddingand #31 cornbread dressing in the south:

cornbread dressing

Meanwhile, new England likes its butternut squash:

butternut squash

By this point, you’re better off clicking through the actual graphic than staring at my reprints… I hope that all of TGR’s American readers had a happy Thanksgiving, regardless of what was on the table.

{ 0 comments }

Choropleths galore

November 16, 2009 in Data, Internet

For a while, I’ve been following development of Indiemapper, a forthcoming web tool from the folks at Axis Maps. It should allow for easy map creation, including – yes – choropleths galore. However, the data analytics that will be available remain to be seen.

{ 0 comments }

It’s been a while since I posted a video for the futurist set, so here we go: (This one is a commercial production for Freeband, heavy on the infographics and benefits of smart networking with a pinch of cheesiness. Sign me up.)

http://www.vimeo.com/7459305

(via Datavisualization.ch)

{ 0 comments }

This morning, I was excited to see two of my interests collide as Nathan from FlowingData posted a tutorial for creating a choropleth: a map that uses color to convey values (I didn’t know that’s what they’re called either). He used county-level unemployment statistics to generate the following image:

However, the process appears quite intense, involving some python scripts and mucking around inside an SVG file. I half-heartedly wondered if there wasn’t a simpler way to create the image. And just then, along came David from Revolutions to throw down the gauntlet: could anyone come up with a way to replicate Nathan’s map in R?

David’s post pointed me toward R’s maps package, and off I went to start downloading the tools…

It took some time to coerce the BLS data into a compatible form; R don’t understand the FIPS county identifiers, so I had to jump through some hoops to get the strings to match (BLS uses state abbreviations; R wants full names. BLS puts in the words “county”, “parish” or “borough”, R doesn’t expect those to be passed. The BLS has a “Miami-Dade” county in Florida; R recognizes only “Dade”. Etc.) Ultimately, I used the following code to format the strings:

With the data in the correct format, I aligned a color vector with R’s list of counties and plotted the result:

It came out like this:

maps package result

Not too bad, I think. It’s a little rough around the edges and a couple of counties are missing – I assume they are the ones with odd naming conventions (you’ll notice I manually adjusted Miami-Dade in my code). Also, I’m not sure how to bring Hawaii and Alaska into the picture. Moreover, the image doesn’t look too good in R itself. For example, I had given up on getting the county borders to show up as faint lines (I could only get them to be completely opaque) – imagine my surprise when I exported the chart and could see the borders just fine!

In any case, I wasn’t satisfied with this result. I’ve been experimenting with ggplot2 and remembered it had some mapping functions, so off I went to recreate the image with yet another library. Ggplot2 is an excellent general-purpose graphics library; the maps package feels positively last-gen after playing with ggplot2. It’s much more extensible and has many more parameters to experiment with – hard to believe it’s not the standard graphics package that ships with R (which itself is another last-gen experience).

Anyway, I kept the data formatted as above – which may have added an extra line or two to the ggplot2 code, but makes it simpler to jump back and forth – and used the following script to draw a new version of the map:

And the resulting image:


ggplot2 package result

Again, a couple drawbacks: Alaska and Hawaii are nowhere to be seen and the borders are slightly aliased. The aliasing does make a difference, especially when compared to the maps output, but the ease with which I put together the latter graph and the frustration I experienced with the maps package, in my mind, more than erase that perceived shortcoming.

On the whole, I’d still take Nathan’s map over these as a finished product. However, I don’t think R can be beat for ease of use and all-in-one packageability – if I wanted, I could run regressions on the data, overlay my chart with more colors or new metrics, explode out certain counties or states… the possibilities are endless. With just a couple lines of code, I could overlay states the voted for Obama in blue, or highlight counties starting with the letter “C”. The static SVG method doesn’t allow any of that flexibility. Also, I’m completely confident that if I had any experience with these mapping packages – rather than using them for the first time tonight – I could mimic Nathan’s image perfectly.

The ggplot2 package, in particular, is fantastically powerful. I really wish I had discovered it sooner. As a matter of fact, Josh Reich runs a monthly R meetup for R users in the New York area and the next topic happens to be ggplot2 – it’ll be my first time attending, so I can’t really say what to expect, but I’m definitely looking forward to it.


{ 3 comments }

Moral hazard and the NFL

November 11, 2009 in Data, Sports

The WSJ asks, “Is It Time to Retire the Football Helmet?” With the debate about football head injuries and CTE swirling, some are wondering if wearing helmets is actually exposing players to greater danger than if their heads were exposed. Though seemingly counter-intuitive, the argument follows well-established moral hazard reasoning that some have perceived in, for example, government bailouts for large financial institutions.

Moral hazard arises when an insured party takes greater risk because they know they are protected. In the NFL, that translates players making and taking more violent hits because wearing a helmet makes them feel invulnerable. The reality, however, is that the helmet protects only from direct trauma to the skull; the brain remains very much at risk.

Taking helmets away would certainly change the sport. Though it’s hard to disagree that all things equal, players with helmets will play more aggressively than those without, not everything would stay equal with that rule change. I suspect the game would evolve to resemble rugby – a sport not without its share of head injuries.

For a data-driven perspective on the head injury debate, please see Jer Thorp and Jeff Clark’s independent analyses comparing two CTE narratives.

{ 0 comments }

How many roads…

October 29, 2009 in Data

Ben Fry has created a stunning image consisting of the 26 million roads in the United States (click to zoom):

Nothing other than asphalt (gravel, dirt…) has been drawn here, but geographic and political features emerge nonetheless. In a very real sense, the geography is a latent feature of the roads dataset, as it creates boundary conditions for the observable effect (that being the roads themselves). In other words, we see mountains, rivers, oceans, and the Canadian border because they are defined by contiguous regions without any streets.

Please see Ben’s project page for more information.

{ 0 comments }

Men are from mars, women are from gmail

October 22, 2009

ReadWriteWeb’s coverage of a new study on webmail demographics contains one sentence that left me a little confused:
Gmail, for instance, includes more females (53%) than males (47%). If those were election poll results, we would call it “too close to call,” but in terms of tens of thousands of users, these percentage point differences have [...]

2 comments Read the full post →

Data intervention

October 16, 2009

The always-excellent How I Met Your Mother addresses a major social problem:

(via FlowingData)

0 comments Read the full post →

How to fix a broken pie chart

September 8, 2009

Datavisualization.ch has a helpful step-by-step on how to turn this (from a Mashable post):

into this:

Of course, the motivation is worth more than the mechanics.

0 comments Read the full post →

Processing

August 28, 2009

John Maeda has written an article for the MIT Technology Review about Processing, the open source visualization language. It’s a very interesting look into the story behind the code. Maeda is the president of the Rhode Island School of Design and was once the director of MIT’s Media Lab, where Processing was born.
Lately, I’ve noticed [...]

0 comments Read the full post →

The million dollar question

August 20, 2009

Straight from GigaOm, emphasis mine:
Despite all the hype and excitement around the real-time web, access to real-time information online is hardly a new phenomenon. That fact stuck with me after talking to Chris Cox, Facebook’s product director, last week at the social networking company’s headquarters. As he noted, “Real time has been around since [the launch [...]

1 comment Read the full post →

Manhattan in flux

August 13, 2009

A very nice graphic is making the rounds (though I believe it originated in a 2007 issue of Time Magazine) which shows Manhattan’s population density by day and by night. The difference is striking:

Happily, the density bars mimic the placement of Manhattan’s skyscrapers – this follows because obviously the tallest buildings support the highest population [...]

1 comment Read the full post →

Dronish number nerds

August 6, 2009

It’s still not too late for Stats 101: The NYTimes published an article this morning titled “For Today’s Graduate, Just One Word: Statistics.” Of course I love to see articles like this, cognizant of the massive amounts of data we are faced we and acknowledging the efforts of the people trying to sort it all out:
In field [...]

0 comments Read the full post →

Test driving America’s dashboard

August 3, 2009

Recently, the CIO of the United States released a Federal IT Dashboard, to show people exactly how their money is being spent. I’ve played with the site, and found it ultimately heavy on style and light on substance (3D graphs with slick animated transitions only frustrate me while I wait for results). But why read [...]

0 comments Read the full post →

No, Twitter does not deserve a Nobel Peace Prize!

July 13, 2009

Bubble 2.0 datapoint of the day: ReadWriteWeb is running an article with the title “Does Twitter deserve a Nobel Peace Prize? Maybe not yet, but it could someday.” Fortunately, they acknowledge the idea is ridiculous for the moment and are really just responding to this outlandish post by Bush’s Deputy National Security Advisor. Nonetheless, besides [...]

1 comment Read the full post →

Asimov on perceiving the world

July 8, 2009

Via economist Dan Ariely’s blog, this is what Isaac Asimov thought about perceiving the world through data. It is an implicitly Bayesian approach and brings to mind the famous Keynes quote about changing one’s mind. Asimov wrote:
“Don’t you believe in flying saucers, they ask me? Don’t you believe in telepathy? — in ancient astronauts? — in [...]

0 comments Read the full post →

Twitter, exposed

July 2, 2009

Twitter’s data model is, interestingly enough, entirely user generated. Hashtags of every variety, retweets, and other methods of ascribing meta-information to tweets have developed outside any formal structural model or standard. The lone first-party implementation is that a “@” prefix links directly to a person, and even that isn’t fully functional.
All of my problems with [...]

2 comments Read the full post →

Kottke on Twitter

June 25, 2009

No less an authority than Jason Kottke is taking up the “Twitter’s data model sucks” mantle, instantly doubling the size of my little crusade. Actually, Kottke doesn’t even attack Twitter, but rather sites that claim to provide Twitter-organization services, but it’s close enough because it implicitly recognizes that Twitter doesn’t have even a shard of [...]

0 comments Read the full post →

A Google Reader wishlist

June 24, 2009

Google Reader has become an inexorable part of my daily life. It’s the only way I can keep up with the amount of reading I do each day, and as much as I love the service, there are a few things I miss.
Here’s my wishlist for Google Reader:
Intelligent favorites: Right now, I have a “favorites” [...]

0 comments Read the full post →

Inferred ratings and modelling teacher comments

June 24, 2009

Another aspect of my conversation dealt with inferred ratings, a problem I’ve crossed before in other areas. There are two primary cases in which this arises: censored data and self-selection bias.
In the first case of censored data, a problem is caused by the ratings system not eliciting useful responses. An example is a system in [...]

0 comments Read the full post →

Personalized Yelp ratings

June 24, 2009

I had a great conversation last night which at one point verged into the pros and cons of various ratings systems. In particular, we discussed the “star+comment” system used by Yelp, in which between 1 and 5 stars can be assigned in addition to a text comment of arbitrary length.
Yelp does some clever things with [...]

1 comment Read the full post →

The flow of information

June 16, 2009

This NYT article on Twitter and Iran sums it all up (emphasis mine):
“We’ve been struck by the amount of video and eyewitness testimony,” said Jon Williams, the BBC world news editor. “The days when regimes can control the flow of information are over.”

It’s an amazing and deserved accolade for the young service.
But.
To abuse the common [...]

0 comments Read the full post →

Is Opera Unite the anti-cloud?

June 16, 2009

Opera Unite lets users turn their computers into zero-effort servers, allowing easy peer-to-peer access.
Unite: store data locally, access it globally.
Cloud: store data globally, access it globally.
I’m curious about what advantages there are in Unite, other than strict peer-to-peer uses (i.e. sharing photos with just one other person) and the “I don’t want Google to have [...]

0 comments Read the full post →

Illustrating the importance of data visualization

June 12, 2009

Andrew Gelman discusses research on attitudes toward gay marriage, by state, and notes this graph in particular, which shows the change in opinion over the last 15 years:

Critically, he points out that the states which experienced the greatest change in attitude were the ones that already were most receptive. A naive analysis of the data [...]

0 comments Read the full post →

Critiquing the Crimson

June 9, 2009

The Harvard Crimson has published its annual senior survey, which is making headlines in part because very few seniors are going into finance. Selected results were presented in an interesting visualization (the image below links to a full size pdf):

Now that my brother has graduated after successfully steering the Crimson’s business operations to one of [...]

0 comments Read the full post →

Deconstructing the Gaussian copula, part I

June 5, 2009

A number of misconceptions about the Gaussian copula are addressed.

7 comments Read the full post →