Posts about the analysis and presentation of:

Data

From an NYT article on Google’s translation services, this excerpt sums up the most critical transition in machine learning that has happened thus far:

Creating a translation machine has long been seen as one of the toughest challenges in artificial intelligence. For decades, computer scientists tried using a rules-based approach — teaching the computer the linguistic rules of two languages and giving it the necessary dictionaries.

But in the mid-1990s, researchers began favoring a so-called statistical approach. They found that if they fed the computer thousands or millions of passages and their human-generated translations, it could learn to make accurate guesses about how to translate new texts.

{ 0 comments }

Get LOST!

February 2, 2010 in Data, General

LOST is back tonight! And what better way to prepare than an interactive timeline from the excellent NYT graphics team? A good infographic should communicate otherwise-complex ideas in a simple and intuitive manner… oh, never mind, LOST is back and that’s really what matters. Check out the timeline here!

{ 0 comments }

The mathematician’s lens

January 25, 2010 in Data, Math

A beautiful article in the NYTimes contrasts abstract mathematics with the chilling reality of the Mexican drug cartel wars:

I was born in Mexico City, in a world that seems less and less familiar to me. I live now in the opposite corner of the continent. I am training to be a political scientist at Harvard. My passion has remained the afflictions of my homeland, but at Harvard I have found new ways to address them, to use mathematical models — matrices, vectors, equations, regressions — to understand the Mexican drug crisis.

The cartel wars are extremely violent, and the gangs are responsible for reprehensible kidnappings and deaths. They rank among the most deadly periods of organized crime in human history. The author’s goal isn’t to explain how she can analyze the wars from up in an ivory tower; it’s to describe how her mindset and toolkit inform her understanding of the world in any situation.

The article captured me because it never mentions what the author actually models. Instead, it presents her frightened thoughts and her efforts to calm herself by looking at the world through a mathematical lens. But it’s not what you think; there are no emotionally-distant mathematicians here. The author communicates her fascination with tying reality to abstract models, expecting and preempting the protest that reality is too complex and math too simple:

In this violent world, with the man in the blue Chevy whispering at me behind the window, math is my shield. Speaking up about drugs is in these parts a dangerous game. But not if you speak in the language of sigma and conditional expectations. Math protects me from the immediacy of the violence, and it protects me from them.

The beauty of my method lies in its simplicity. With mathematics I’m able to codify and simplify reality to make it manageable and, more important, malleable. I represent each possible individual as an equation in which each term symbolizes tastes, goals, profession and abilities. All people get portrayed: Policemen, politicians, citizens and drug cartels start living in this mathematical world as planes and hyperplanes and, as in real life, they interact and affect one another, sometimes colluding, sometimes colliding, sometimes neither.

I then use optimization to predict the form of interaction that will be the most probable to emerge and remain over time. Math starts speaking. It tells me, for example, under what conditions the outcome would be a drug war; when would the government prefer to cooperate with cartels; or when cruel intra-cartel purges will become the norm.

There is a part of every modeler’s mind which is constantly teasing out variables from constants. The statisticians among us may take a frequentist view, and wonder what would happen if a scene played itself out a million times; the programmers will deduce the underlying algorithms from the fuzzy result; the pure mathematicians will see manifolds everywhere:

In this abstract microcosmos, reality can be frozen or just slightly changed. I move and look at my hyperplanes from different angles. Let’s change the penalty code. No, let’s increase patrolling. Or reduce wages. Allow less contact between policemen and dealers. Assume the police force is corrupt. Assume it is not. I solve the equations and there it is. My answers come as Greek letters and probabilities.

But we all admit:

I know, I know, this is weird.

Ultimately, “free will” becomes the clarion of the independent. At least, it’s the best response to this explanation:

It may seem strange to examine this shadowy world with equations. But mathematics is transforming the social sciences. In the same way that physicists can predict the movement of atoms in space, we can use mathematics to model how individuals and groups will make decisions and interact in a society.

But free will has a (somewhat tentative) analogue in Heisenberg’s uncertainty principle, and with that philosophy and math (or theology and physics) are combined — but there’s been plenty of pop-sci written on that topic.

I found this brief article remarkable in how it was able to demonstrate the overlay mathematical thought on an extremely “human” subject without ever needing to explain either one.

(Via Drew Conway)

{ 0 comments }

Microsoft has announced the system requirements for Office 2010.

That’s news in and of itself. Once upon a time, system requirements (at least, ones that anyone paid attention to) were strictly for high-end professional software, cutting-edge games and the like: software that actually needed powerful hardware. But the real news here is that Office 2010 requires a DirectX-compatible graphics card.

Now, I don’t think Word is going to be offloading word counts to a GPU anytime soon. But Microsoft’s announcement is making waves nontheless — and I think it’s actually great. It means we’ve reached a point where our computing history is so mature that even our mass-market word processors have achieved a level of sophistication that we need to make sure of their compatibility. That’s exciting!

Certainly, Excel is an obvious candidate for hardware acceleration, which, besides accelerating simple tasks like opening large files and parallel tasks like running many equations, could finally bring true vector operations to the versatile software.

But there is bad news. I’ll let Microsoft break it to you:

If your computer has a GPU, it lets us perform graphics rendering tasks (like drawing charts in Excel, or transitions in PowerPoint) in the GPU instead of in the CPU, which parallelizes work and speeds up performance. This is particularly relevant for users of PowerPoint 2010, which will introduce some awesome new graphics and video integration features (more info at the PowerPoint team blog).

Yes, the true motivation behind the graphics upgrade is supercharging those awful 3D pie charts we know and despise.

(If you click the PowerPoint link, you’ll notice that Powerpoint 2010 looks a lot like Keynote. Just sayin’.)

{ 0 comments }

Very amusing… and true:

I especially love “The HDR Hole.” Presumably the y-axis is measured in percent of personal potential… there must be all sorts of Bayesian self-reflection stuff going on there.

(Via DataViz)

{ 0 comments }

Suggestions

January 24, 2010 in Data

(Via Piled Higher and Deeper)

{ 0 comments }

Data wars

January 11, 2010 in Data

The NYT writes about the military’s data problem:

Air Force drones collected nearly three times as much video over Afghanistan and Iraq last year as in 2007 — about 24 years’ worth if watched continuously. That volume is expected to multiply in the coming years as drones are added to the fleet and as some start using multiple cameras to shoot in many directions.

A very interesting read for the dataheads among us. The comparison to football broadcasts also caught my eye – televised sports are so frequently compared to battles and war, and here we see the army coming to the athletes for advice:

But while the biggest timesaver would be to automatically scan the video for trucks and armed men, that software is not yet reliable. And the military has run into the same problem that the broadcast industry has in trying to pick out football players swarming on a tackle.

So Cmdr. Joseph A. Smith, a Navy officer assigned to the National Geospatial-Intelligence Agency, which sets standards for video intelligence, said he and other officials had climbed into broadcast trucks outside football stadiums to learn how the networks tagged and retrieved highlight film.

{ 0 comments }

Chart Wars

January 8, 2010 in Data

Alex Lundry, Vice President and Director of Research of the consulting firm Target Point, has published a brief talk called Chart Wars which is simply brilliant, serving as an excellent but brief (5 minutes!) overview of how easy it is to manipulate infographics and what tricks to be wary of. His specific focus is a chart (which was covered on TGR previously) whose designs – and it went through many iterations – were politically motivated. While there is no doubt about which charts are more clear, his implicit question – which charts are right? – resonates philosophically.

Here’s the video of his talk:

(Via Information Aesthetics)

{ 0 comments }

Walmart ad math

December 23, 2009 in Data

Walmart is running ads right now which claim that shoppers who spend more than $100 per week at the supermarket would save $650 a year by purchasing their groceries at the giant retailer instead.

That’s quite a jumble of conditionals and varying metrics: you have to first meet the requirements of shopping at a supermarket and spending over $100 per week; the savings are then presented in a completely different timeframe of one year. That works out to $12.50 a week, or a still sizable 12.5% discount.

Why not present it as 12.5%? The simple answer is that “$700″ is a substantial figure, and the marketing folks wanted to make people feel like they were saving more; conversely, $5200 a year on groceries sounds like a lot – better restate that as $100 per week. Depressingly, it occurs to me that many Americans may not know what to do with percentages.

Another key point is found in the wording of the ad – why target shoppers who spend more than $100 a week? If Walmart’s prices are really lower, then all shoppers should reap a benefit, not just the high rollers. Since I do not think Walmart is price discriminating (offering discounts only to people spending more than $100), I have to conclude that they restricted their dataset to increase the dollar value of the average person’s savings. If every shopper saved 12.5%, then the average annual dollar savings per person might be, say, $250. But if we consider only people who spend more than $100, the average dollar savings jumps to $650 even though the percent savings remains 12.5%. I would guess that $100 was chosen as a cutoff because a) it’s a round, friendly number which b) creates a relatively high average dollar savings while c) remaining low enough to be in reach of many American families. This, of course, is further evidence that Americans don’t understand percentages well (or at least, that marketers think they can fools us by avoiding them).

Note also that all of my calculations use the stated minimum figure of $100 vs the average figure of $650 to get the 12.5% discount. That’s not a real discount – someone spending $100 wouldn’t get $650 in savings, as that is the average of all the people spending more than $100. That person would realize a smaller dollar savings, and the real discount rate must therefore be less than 12.5%.

{ 0 comments }

Modern confessionals

December 22, 2009 in Data, Internet

We all know that you can get some funny/interesting responses by typing the first part of a question into a major search engine’s search box and letting it suggest the remainder. The NYT has gone so far as to investigate those suggestions themselves. I particularly enjoyed their description of search engines as “modern confessionals:”

This labor-saving device — part fortuneteller, part shrink? — has opened a window into our collective soul. With millions of people pouring their hearts into this modern-day confessional, we get a direct, if mysterious, glimpse into the heads of our fellow Web surfers.

And some nice visualizations of the questions people are asking don’t hurt, either:

I’d love to see an interactive tool for creating these diagrams.

{ 0 comments }

Overcharting: airfare edition

November 28, 2009

Nate Silver writes about the dropping cost of air fares – yes, you read that correctly – over at Five Thirty Eight. His writing, as always, is excellent – I only want to point out a chart he uses and how it can be dangerous to draw conclusions at a glance (or, if you prefer, [...]

0 comments Read the full post →

It’s American as sweet potatoes (but not sweet potato pie)

November 27, 2009

The NYT has published an infographic showing the top recipe searches on Allrecipes.com. Searches are broken out by state, allowing some interesting comparisons. (Local dialects and preferences are an interest of mine, and when combined with maps I can’t resist… see also various words for soda.)
Here’s the chart for “apple pie”, the 5th most popular [...]

0 comments Read the full post →

Pie chart fail

November 27, 2009

Via FlowingData, I found this amusing pie chart from a local Fox News broadcast:

The survey plainly allowed people to give more than one answer, resulting in responses that were not mutually exclusive. It’s tiresome but bears repeating: pie charts are only suited to data which adds up to 100% (and then, only if there are [...]

0 comments Read the full post →

Choropleths galore

November 16, 2009

For a while, I’ve been following development of Indiemapper, a forthcoming web tool from the folks at Axis Maps. It should allow for easy map creation, including – yes – choropleths galore. However, the data analytics that will be available remain to be seen.

0 comments Read the full post →

Choropleths in R (yes, “choropleths”)

November 12, 2009

Using R to recreate color-indexed maps of US unemployment data.

3 comments Read the full post →

Moral hazard and the NFL

November 11, 2009

The WSJ asks, “Is It Time to Retire the Football Helmet?” With the debate about football head injuries and CTE swirling, some are wondering if wearing helmets is actually exposing players to greater danger than if their heads were exposed. Though seemingly counter-intuitive, the argument follows well-established moral hazard reasoning that some have perceived in, [...]

0 comments Read the full post →

Ten statisticians every psychologist should know

November 11, 2009

Psychologist Daniel Wright has published a list of ten statisticians every psychologist should know.
The list is comprised of The Founding Fathers:
1. Karl Pearson – who established statistics as an academic discipline
2. Ronald Fisher – who developed much of statistics’ mathematical foundation, including ANOVA and maximum likelihood, and the importance of p-values
3. Jerzy Neyman [...]

0 comments Read the full post →

How many roads…

October 29, 2009

Ben Fry has created a stunning image consisting of the 26 million roads in the United States (click to zoom):

Nothing other than asphalt (gravel, dirt…) has been drawn here, but geographic and political features emerge nonetheless. In a very real sense, the geography is a latent feature of the roads dataset, as it creates boundary [...]

0 comments Read the full post →

Don Draper would be proud

October 28, 2009

Recently, there have been countless ads for auto insurance all making a similar claim: drivers who switch to that firm save significant amounts of money. How can every major insurance company make a similar statement? They can’t all be cheaper than every other company, on average.
As a particularly egregious example, Allstate’s website declares it via [...]

0 comments Read the full post →

How Shazam works

October 28, 2009

Ever wondered how song-identifying iPhone app Shazam works?
Now you know.
(For the link-averse: it’s a pretty cool implementation of pattern matching across song spectograms, and the key insight was to first reduce the spectograms by including only peak frequencies. Simple, yet genius.)
(via Revolutions)

1 comment Read the full post →

Men are from mars, women are from gmail

October 22, 2009

ReadWriteWeb’s coverage of a new study on webmail demographics contains one sentence that left me a little confused:
Gmail, for instance, includes more females (53%) than males (47%). If those were election poll results, we would call it “too close to call,” but in terms of tens of thousands of users, these percentage point differences have [...]

2 comments Read the full post →

Data intervention

October 16, 2009

The always-excellent How I Met Your Mother addresses a major social problem:

(via FlowingData)

0 comments Read the full post →

Suspicious poll distributions

September 25, 2009

I’ve covered Benford’s method for first-digit fraud analysis before, and now Nate Silver has applied a similar method to polling results. He looked at the last digit of various polls (i.e. a 48% McCain, 49% Obama, 3% undecided poll would be recorded as an 8 and a 9) and compiled histograms of their frequencies. Following [...]

0 comments Read the full post →

Users still aren’t right about changes

September 20, 2009

Once again, the self-proclaimed “experts” of social media are revealed to be not much more than some anecdotes and a keyboard. The latest is Dan Zarrella, who has written a vitriolic attack on Twitter’s planned adoption of the retweet as an official mechanism. Zarrella does some excellent work in other areas, but I find him completely [...]

0 comments Read the full post →

Twitter’s broken data model becomes slightly less broken

September 19, 2009

I don’t usually have anything nice to say about Twitter (though I still ignore my mother’s advice and say it anyway), but the company is finally taking steps to improve one of the most glaring faults with their service: retweets.
Previously, retweets were simply new tweets that happened to contain old information. This created clutter and [...]

0 comments Read the full post →

Radial clustering

September 14, 2009

Finally, a radial visualization which serves a purpose rather than just looking cool. Getting Genetics Done has a tutorial on using clustering functions in R. In it, they show how this this analysis:

is much better represented like this:

There’s nothing wrong with making a chart which looks good – in fact it’s encouraged - so long as the visual [...]

0 comments Read the full post →

Questionable rankings

September 14, 2009

I read this morning about the drama at last night’s MTV video awards (does anyone actually watch this stuff?), but the episode was overshadowed in my mind by a quirky accident of rankings: if Taylor Swift beat Beyonce for the “Best Female Video”, how can Beyonce go on to win “Video of the Year”? Presumably, video [...]

0 comments Read the full post →

How to fix a broken pie chart

September 8, 2009

Datavisualization.ch has a helpful step-by-step on how to turn this (from a Mashable post):

into this:

Of course, the motivation is worth more than the mechanics.

0 comments Read the full post →

Augmenting reality

September 3, 2009

BMW is actively researching the use of augmented reality for servicing cars:

Augmented reality (AR) has been getting a lot of press for recent advancements on the iPhone and Android platforms. While it’s nice to see these developments, thus far I’ve thought the excitement is a bit premature. It’s as if we all know how amazing [...]

0 comments Read the full post →

More on solar coverage

September 2, 2009

Maybe there’s something in the water today – no sooner had I finished estimating the Earth’s solar radiation than this popped up on Cool Infographics:

The map was created by the Land Art Generator Initiative to show the amount of solar panel coverage required to power the Earth for one year. Very interesting, and this has [...]

0 comments Read the full post →