A number of posts about:

Math

My former department chair, Xiao-Li Meng, has published an excellent article on the emergent role of statistics and the challenge of teaching the science to non-statisticians. He addresses the negative perception of the field, often ingrained by a poor high school experience and summed up in a dismissive scoff that “the best speaker in statistics” is hardly an accolade at all.

Ultimately, he views statisticians as quantitative authorities:

We statisticians, as a police of science (a label some dislike but I am proud of; see the next section), have the fundamental duty of helping others to engage in statistical thinking as a necessary step of scientific inquiry and evidence-based policy formulation. In order to truly fulfill this task, we must constantly firm up and deepen our own foundation, and resist the temptation of competing for “methods and results” without pondering deeply whether we are helping others or actually harming them by effectively encouraging more false discoveries or misguided policies. Otherwise, we indeed can lose our identity, no matter how much we are desired or feared now.

I think the title is appropriate but limiting; it isn’t the statistician as a person who is a police officer as much as it is the proper application of the field itself that is designed to eliminate poor analyses. Statisticians are merely the people trained with such knowledge, and there is nothing preventing a statistician from performing analyses of his own so long as he is able to properly moderate his own work.

Professor Meng’s most salient point in my mind comes near the end, when he calls for more attention on the quality of teaching, particularly in introductory classes:

With their potential impact in mind, it is easy to see the necessity of having the most qualified teachers for these introductory courses, just as for more advanced ones. And if I had to make a choice (and sometimes I do as a department chair), I surely will give the general introductory courses the highest priority for a very simple and practical reason. If an advanced course is sabotaged by bad teaching, the chances are that it will only affect a relatively small number of students, most of whom would have, or already have had, another chance to study statistics and to be convinced of our beloved subject’s beauty and importance.

In sharp contrast, if a general introductory course is badly taught, it often will affect hundreds, or even thousands, of students, and the vast majority of them will never take another statistical course, even if some of them initially had some curiosity or interest in statistics. This is very much like a badly taught AP statistics course that can do more harm than help, permanently turning away many of its students, as all they saw was “Oh, this is what statistics is about—boy, am I glad that there are many more interesting and relevant subjects in college than this!” Indeed, among the Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a “turn-off” experience from an AP statistics course.

I think this is a great paper with a teaching message I firmly believe in. It’s a short read with absolutely no technical nonsense – its tone is actually almost colloquial.

(via Andrew Gelman)

{ 0 comments }

Bell curves in action

August 6, 2009 in Math

An exhibit at MOMA invites visitors to mark their heights on a wall. A normal distribution results:

Well, not quite. The distribution is actually slightly negatively skewed by the confounding presence of children, who are obviously shorter than adults – you can see this in the great number of names well below the central band which are not mirrored by names higher up. Rest assured, however, that the ex-children distribution is itself Gaussian.

{ 1 comment }

A deathly serious game

August 6, 2009 in Math

FT infectious disease tool

The FT provides a tool for simulating the spread of an infectious disease. Though they caution that the model is not based on any complex algorithm, simplicity should not be mistaken for error. Even a simple routing scheme like this one can capture many of the dynamics of the underlying process, as it doesn’t take a particularly complicated network to emulate the real thing.

I would guess that they have a few preprogrammed routes which travelers randomly follow, coupled with a random probability of becoming infected after coming in contact with a sick person. I suppose the whole thing could boil down to a few Markov chains if you wanted to get away from nice infographics – but this is an excellent and informative example of translating that math into a tangible and accessible exercise.

The tool is very interesting, in a morbid way, though I will point out that the highest possible infection rate is 101%, while the highest possible mortality rate is 99% (yes, I checked).


{ 0 comments }

A month ago, the million dollar Netflix prize was finally won by a coalition of leading teams called Bellkor’s Pragmatic Chaos, who blended their respective methods into a super-algorithm that finally crossed the 10% improvement barrier.

…or was it?

The 10% mark sent the competition into a final, 30-day countdown, during which time other teams could submit scores. After 30 days, the contest would be over. With just one day remaining before the contest ended, a bunch of other teams turned the tables on the leaders and formed their own blended coalition, The Ensemble. The Ensemble eeked out a 0.01% improvement over BPC’s score, landing them squarely in first place.

A scant 20 minutes before the deadline, BPC submitted a new entry which tied The Ensemble for first place – but with just 4 minutes remaining in the years-long competition, The Ensemble put up a new, marginally-higher score, sealing first place.

…or did they? (yes, two plot twists!)

It looks preliminarily like BPC may have won the competition. The contest was structured to avoid a common problem in statistics: overfitting. Overfitting occurs when a model is trained to a dataset in such a way that it is not able to describe similar data outside the original set. Overfit models are useless for forecasting.

The Netflix prize avoided overfitting by providing a two datasets, a training set and a test dataset: competitors used the training set to build and optimize their model, but scores were based on the model’s fit to the test set. Thus, an overfit model would fail. But it didn’t end there – scores were actually based on only half of the test set. These scores determined when the contest ended, but not the winning team; the winners would be the team that had the best score on the hidden half of the test set once the evaluation period ended.

So, The Ensemble managed to get the highest score on the public half of the test set but it seems that BPC may actually have the marginally higher score on the hidden half, which is the one that really matters. The implication in BPC’s blog post (which has a much better summary than I) is that The Ensemble overfit the test set – I don’t think that will end up as the most accurate assessment, however. The Ensemble will surely score highly on the hidden half; overfitting would mean they couldn’t describe it well at all.

I still find it somewhat amazing that in a contest that lasted for three years, the conclusion boils down to a few minutes and mere decimals. But I guess that’s what happens when you tell a bunch of nerds you’ll give them $1mm to do what they love to do anyway…

{ 0 comments }

A response to randomness

July 12, 2009 in Math

In response to my post on the WSJ’s recent randomness article, B emailed me the following (reproduced here with permission):

The quoted WSJ article writes

“We find false meaning in the patterns of randomness for good reason: we are animals built to do just that… Many studies illustrate how this basic aspect of human nature translates to a misperception of chance.”

This cannot be right. It cannot even be meaningfully wrong. Animals are not built; we are constantly re-encoded at every generation by a process that selects only for whatever helps the survival of offspring that will of course carry that novel encoded information. Over the past ~ 10 million years, this positive natural selection on our ancestral hominid DNA has selected our species for delayed brain development, with the longer period of vulnerability allowing – requiring – love and attention.

From that infant socialization unique to our species, comes the emergence of subjectivity, free will and self-consciousness, by imitation of the care-giving adult’s constant attentions. So if we are built for anything, we are built for reproducibility of emotional states: imitation in other words is a necessity, and imitation is by definition a pattern.

In short, love is a pattern, and a random sequence of emotional states is a torture; to make this a matter of intellectual habit, misses the point.

{ 0 comments }

(Parts I II and a half and III of this series are also available.)

Recently, I addressed a great deal of misinformation regarding the Gaussian copula and it’s role in the 2008 crisis. I would like to try and follow that up with a succinct description of the copula and its use in CDO pricing. (This may seem a defense of the math behind the process, but you know I’m just setting it up for a fall.)

Introduction

David Li’s contribution to quantitative finance was the rapidly-standardized “single factor Gaussian copula” CDO pricing framework. The real crux of the problem was the “single factor” part – not the Gaussian copula itself (though we won’t pull any punches here). In an extraordinarily broad sense, a copula is a mathematical function that describes how two or more random variables interact. “Correlation” is a simple way of describing the copula, which should give the function some intuitive grounding. But let’s back up a second and figure out why we even need a copula in the first place.

Aside: Why Copulas?

If you try to model the behavior of many random variables, you need a multivariate distribution. The most mathematically friendly distributions are from the Gaussian family, including the familiar bell (or normal) curve. This is why such models are prevalent in all manners of statistics. For most purposes, the model is not only easy to work with but asymptotically correct (which is a nice feature, to put it mildly). However, there are some areas where the model choice is more for pragmatic reasons than justified ones – finance being prime among them. Indeed, financial distributions do not behave normally, but only recently have tools been developed that can describe them – and even there large joint distributions are daunting.

So, it is unsurprising that the Gaussian copula arose as a natural choice for modeling the joint distribution inherent to CDOs – which are essentially just collections of many intercorrelated credits.

But I’m getting ahead of myself. (This is much easier to discuss than to write about, I think, because you can guage your audience’s comfort which each boldfaced section before moving on. I hope, brave reader, that you are still there.) Lets talk about CDOs.

CDOs

A CDO is nothing more than a collection of various bonds, all held together in a basket. The principal risk of a CDO is default: the chance that one or more of the bonds will not survive to maturity. To isolate this risk, it is instructive to think of the CDO as a basket of sold CDS contracts, rather than a basket of purchased bonds (and indeed, “synthetic CDOs” are nothing more than CDS portfolios and have rapidly gained market share from bond portfolios). Thus, the buyer of a CDO needs to draw two conclusions regarding the basket:

  1. Will any of the credits default?
  2. When will all of those defaults occur?

The first point is obvious; the second gets at the heart of the problem. Both the timing and the correlation of defaults matter. If the CDO basket is comprised disproportionately of financial companies, then default by one may imply a greater likelihood of default for the others; a more diversified basket may not exhibit such dependencies.

This issue is compounded by the introduction of tranches – a staple of the CDO industry. Again, it is helpful to consider a CDO as a basket of sold CDS. The most junior (or “equity”) tranche has, by definition, sold insurance on the first few issuers to default – say, the first 3. The next tranche does not experience a loss until the 4th issuer defaults. The key here is that when a portfolio is tranched, investors have not sold CDS on specific issuers by name, but rather by time of default. They can not know ahead of time which issuers they are effectively responsible or on the hook for.

Bathtub Correlation

To understand why tranching compounds the correlation problem, think of the CDO as a rectangular bathtub interspaced with mines that represent each issuer’s default. The CDO investors are aboard a boat on one side of the bathtub, and need to cross to the other side. If the boat hits a mine, that issuer defaults, and the explosion of the mine will damage the boat. The equity tranche has an extremely thin hull and will sink quickly; the senior tranche has a thick hull and can withstand many blasts without taking damage. Finally, the boat moves across the bathtub via geometric brownian motion – which is to say, randomly.

In a low-correlation world, the mines are dispersed uniform randomly across the bathtub; hitting one mine does not imply or necessitate hitting any other. With high correlation, the mines cluster somewhere in the water; hitting one mine makes it relatively certain that another will be hit.

As a consequence, equity investors prefer high correlation. They are indifferent to hitting just a few mines or many, as they are wiped out in both situations. Therefore, they prefer the mines to be clustered, as this leaves more clear paths across the bathtub. In contrast, senior investors prefer low correlation – they can withstand glancing off a few mines, but hitting a cluster would wipe them out.

From this intuitive example, it should be clear that not only the timing of the defaults, but also their expected clustering (i.e. correlation) is important when valuing a CDO tranche.

Correlation in the Guassian Copula

Let us first draw the connection I’ve sketched out already: CDOs are composed of many issuers that may interact with each other; and a multivarite normal distribution is a common method of describing such behavior. So far, so good.

Like any Gaussian multivariate model, the Gaussian copula takes as parameters the correlation of every pair of variables under consideration. (In other words, to make the model work, you need to “explain” to it how every issuer interacts with every other issuer – these are the parameters.) Thus, the number of parameters increases with the square of the number of variables being considered – specifically, there are \frac{N(N-1)}{2} parameters. If you had a CDO of 100 names, you would need to compute 4,950 parameters to describe their behavior! It doesn’t take a statistical degree to appreciate the flimsiness of a model which relies on such assumptions – it’s just too many to estimate reliably. Clearly, the traditional model simply won’t do.

Enter David Li, whose principal contribution to this field is to boil 4,950 parameters down to just one.

Shocking! Dastardly! The decision that caused the 2008 crisis! Well, not really. Though I am full of doubts about the validity of the Gaussian copula for this task in the first place, I do not think that the compression of its parameter space is the chief culprit by any means.

What Li was suggesting amounted to this: instead of modeling the intricate inter-corporate correlation structure, in which financials are highly correlated to each other but bear little semblance to utilities, which themselves are very similar, he said why not just model everything at the average correlation of the CDO names? Actually, he just said that one correlation level will be enough to describe the CDO price – he did not say it was the average (I just added that to make the notion more tolerable at first glance). He didn’t care if you chose a higher or lower correlation than any pair in the whole CDO exhibited; his claim was that there was some single number that would get the model to output a price that matched the market.

Before we get up in arms about this let’s remember that most financial instruments are priced this way. One or more variables of the equation are left free to change, such that for some level the model will output the “correct” (or market-observed) price. With options, this is called volatility; with swaps this is the fixed rate; with bonds this is the yield – I particularly like the last example because most people assume this is limited to derivatives. It’s not, “real” securities exhibit this problem too —  for stocks, it’s called a P/E ratio.

So, we’ve boiled correlation down to one parameter which can take any value, but forces all issuers to have the same correlation to each other AND (this is a much more important caveat) exhibit a Gaussian dependance structure.

Now What? This Is Getting Boring.

Ok, let’s price a CDO.

If I have CDS prices for all the issuers in my CDO, I can back out the probability of each issuer defaulting. (That’s a whole other lecture, but please take my word that if we have the price of default insurance, we can calculate the probability of default. Otherwise I’ll go on for another 2000 words…) This answers my first question: will defaults occur? Combine that with a correlation number and I can answer the second question: when will all the defaults occur? So now I can price the CDO, right? Unfortunately, no.

The default probabilities backed out of the CDS data are conditional default probabilities, meaning they have the market’s 4,950 correlation factors baked into them. Company A may be doing fine, but it’s very correlated to company B which is not so healthy. The result is that company A’s CDS will exhibit a relatively high default probability even though that’s more B’s fault than A’s.

In statistics, we like to deal with independent or unconditional probabilities, because the math becomes dramatically easier. So the conditional probabilities extracted from the CDS are not so useful, and must be transformed into independent probabilities. To achieve this goal, we do something that I think is very clever:

We set up a model in which defaults are driven by a shared “market factor” and an idiosyncratic factor, similar to a regression with one dependent variable and an error term, hence the name “single factor model.” Now, I know I just said there are two factors, but one is specific to each individual issuer, so it doesn’t count as one of the model factors — if this troubles you, chalk it up to statistical nuance. Anyway, the two drivers are weighted by a correlation term; as correlation increases the market factor dominates, and as it decreases the idiosyncratic factor dominates.

Now, suppose for a moment we knew the value of the [random] market factor. In this case, default would be driven solely by the idiosyncratic factor (since the market factor is fixed, and we have chosen it such that all names either are – or are not – in default). The idiosyncratic factor is, by definition, independent across all issuers. Therefore, we have artificially created a scenario in which defaults are independent for each issuer by conditioning the market factor on a certain level. More specifically, we have generated a set of conditionally-independent default probabilities. Now, repeat the process for every issuer and every market factor level. The result is a complete picture of how every issuer behaves in every possible situation. From this, the unconditionally independent probabilities can be extracted.

(If that isn’t quite clear, suffice to say there’s a bit of math behind it. Interestingly, the math is surprisingly simple, but with the exception of the number of factors in a Gaussian model I have promised not to write out any equations in this post, so in the absence of symbols I hope you will accept my reasoning.)

So now, we have the probability of every issuer independently defaulting at any given time – with that information, it is relatively straightforward to figure out the expected loss on the portfolio. In fact, it’s mainly arithmetic at this point: the value of the portfolio is just the probability-weighted average payoff of all the issuers.

And that’s really it – that’s how the Gaussian copula is used to price a CDO, or a collection of sold CDS on many issuers. We calculate the default probabilities from the CDS, then we use the Gaussian copula to tell us how they relate to each other. You’ll notice that I never actually mentioned the copula when discussing the probability model – that’s because you don’t really need it. It happens that the copula math simplifies nicely into something that is almost, but not quite, entirely unlike a copula (hey! a Douglas Adams reference!). However, the copula-based approach is more informative, even if copula-specific math per se doesn’t enter the picture.

And why is this so bad?

A few of the modeling decisions I’ve described above are unquestionably poor ones, though it may not be obvious how to improve them. Here is my brief rundown:

  • The Gaussian dependence structure – what’s wrong with it? What alternatives are there? Why are they better?
  • The single factor – is it really sufficient to describe the behavior?
  • The single correlation number – is it sufficient to describe the behavior? Can we reliably estimate more relationships? Is correlation the right metric in the first place?

I’ll attempt to answer all these and more in part III…

{ 9 comments }

On the most recent Top Chef, one of the chefs received just half a star (out of five) for his initial dish – impressive because the dish hadn’t finished cooking and wasn’t even entirely served. More interesting perhaps was his response to getting half a star:

“I was sure I’d get nothing… but that’s 50% more than I thought I would get.”

Naturally, I found the statement confusing. What’s 50% more than zero? Not a half – it’s still zero! Of course, the error is semantic rather than mathematic – he clearly meant “that’s 50% of a star [stars being the relevant unit of measurement] more than the zero stars I expected.” Still, the phrasing was awkward and highlights the dangers of measuring differences in values by percent change.

Unless there is a clear base measure and transitive measure, percent changes can be misleading. When there is a known quantity which then transforms into another quantity, then the change has a clear direction and a percent change is ok: today a stock is worth $100, tomorrow it is worth $110; its percent change is unambiguously 10% because we know which value came first (the base) and which came second (the transitive).

But what if the question were unclear, and instead of “how much higher is the stock today than yesterday” it was “how much lower was the stock yesterday than today?”

Consider an extreme case: first the stock is at $100, then it moves to $150. It ends up 50% higher than it started; conversely it started just 33% lower than it finished. These two measurements are incompatible, and unless the direction of change is very clearly established, this is a prime area for distortion via statistics. For example, a company reducing inventory can inflate their activities by reporting how much higher the level was previously as opposed to how much lower it is now. In some cases, it will be obvious which percent change should be used; it others it may not.

Recently I dealt with an issue of comparing prices discovered in a liquidity exercise. The questions of “how much higher is price Y than price X” and “how much lower is price X than price Y” were equally valid and I could not determine ahead of time which would be of interest. To eliminate any ambiguity, I disregarded percent changes altogether and used log-differences instead. Log-differences have the nice property of roughly approximating percent changes for small numbers, without the asymmetry of having to choose a base and transitive value. Back to my extreme example, 100 to 150 is either a 50% or 33% change, depending on the question of interest; using logs it is ln(150)-ln(100) = 40.5%. Note that 40.5% is quite close to the geometric mean of 50% and 33%, or 40.8%. Also, it doesn’t matter if 100 or 150 is my base; the change is 40.5% either way. Thus, the ambiguity of using percentages to describe relative changes is eliminated.

Unfortunately, after all that, the chef’s statement still can’t be saved – the log difference of any number and 0 is undefined (or negative infinity, if you prefer). So yes, this was all just an excuse to discuss logs as an alternative to percents. C’est la vie.

It’s important to note that for large changes, as in my example, logs will not approximate the “true” difference, but rather something close to the average of the assymetry. For small numbers, they will approximate it well. The math which determines this property is related to that of stationarity in brownian motion – for example, stock prices are modeled in finance as log-differences, not percent changes. But that’s another story.

{ 0 comments }

The WSJ has printed one of the best “fooled by randomness” pieces I’ve seen in quite a while, titled “The Triumph of the Random.” This one uses streaks in sports as a central metaphor, with DiMaggio’s 56-game hitting streak as exhibit A. It presents an immediate disclaimer:

Recent academic studies have questioned whether DiMaggio’s streak is unambiguous evidence of a spurt of ability that exceeded his everyday talent, rather than an anomaly to be expected from some highly talented player, in some year, by chance, something like the occasional 150-yard drive in golf that culminates in a hole in one. No one is saying that talent doesn’t matter. They are just asking whether a similar streak would have happened sometime in the history of baseball even if each player hit with the unheroic and unmiraculous—but steady—ability of an emotionless robot.

The lengthy article then deals with the mathematics of streaks, demonstrating that they are far more probable than we would otherwise think:

A few years ago Bill Miller of the Legg Mason Value Trust Fund was the most celebrated fund manager on Wall Street because his fund outperformed the broad market for 15 years straight. It was a feat compared regularly to DiMaggio’s, but if all the comparable fund managers over the past 40 years had been doing nothing but flipping coins, the chances are 75% that one of them would have matched or exceeded Mr. Miller’s streak.

Next, it moves to psychology and describes the way in which humans seek patterns in randomness as a grounding mechanism with a nice segway by way of my favorites, Kahneman and Tversky, who authored a seminal paper on hot hands in basketball:

If a person tossing a coin weighted to land on heads 80% of the time produces a streak of 10 heads in a row, few people would see that as a sign of increased skill. Yet when an 80% free throw shooter in the NBA has that level of success people have a hard time accepting that it isn’t. The Cognitive Psychology paper, and the many that followed, showed that despite appearances, the “hot hand” is a mirage. Such hot and cold streaks are identical to those you would obtain from a properly weighted coin.

Finally, it deals with the perception of random events:

Why do people have a hard time accepting the slings and arrows of outrageous fortune? One reason is that we expect the outcomes of a process to reflect the underlying qualities of the process itself. For example, if an initiative has a 60% chance of success, we expect that six out of every 10 times such an initiative is undertaken, it will succeed. That, however, is false.

A critical conclusion is laid out:

We find false meaning in the patterns of randomness for good reason: we are animals built to do just that… Many studies illustrate how this basic aspect of human nature translates to a misperception of chance.

Truly an excellent read and I can’t recommend it more.

{ 1 comment }

How did I miss this?

July 1, 2009 in Math

In a post called “So Long and Thanks for All the F-Tests“, Freakonomics writes about a new book called Mostly Harmless Econometrics: An Empiricist’s Companion, which they describe as:

…the rare book that captures the feeling of how to go about trying to attack an empirical question; and it does this by working through two or three dozen of the neatest empirical papers of the last decade…. It is also peppered with references to Douglas Adams’s writing — so what’s not to like?

Now, wait a second. TGR is peppered with Douglas Adams references. TGR has a post titled “F-tests begone!” How did I miss this?

(Freakonomics’ post title and the book’s title are themselves HHG2G references. I have refrained… for the moment.)

{ 1 comment }

On teaching math

June 29, 2009 in Math

Arthur Benjamin gives a short (3 minute) TED talk on the problems with how math is taught to high school students in America. He notes that the current curriculum is a sequence beginning with arithmatic and leading to the ultimate goal of calculus. But calculus isn’t something most people use once they graduate – how often have you heard students argue against math homework by saying, “But when am I ever going to need to know this?”

Instead, Benjamin suggests replacing calculus with statistics, noting “that’s a subject that you could – and should – use on a daily basis.” Moreover:

The world has changed from analog to digital. It’s time for our mathematics curriculum to change from analog to digital. From the more classical continuous mathematics to the more modern discrete mathematics – the mathematics of uncertainty, of randomness, of data – that being probability and statistics.

{ 0 comments }

Inferred ratings and modelling teacher comments

June 24, 2009

Another aspect of my conversation dealt with inferred ratings, a problem I’ve crossed before in other areas. There are two primary cases in which this arises: censored data and self-selection bias. In the first case of censored data, a problem is caused by the ratings system not eliciting useful responses. An example is a system [...]

0 comments Read the full post →

Personalized Yelp ratings

June 24, 2009

I had a great conversation last night which at one point verged into the pros and cons of various ratings systems. In particular, we discussed the “star+comment” system used by Yelp, in which between 1 and 5 stars can be assigned in addition to a text comment of arbitrary length. Yelp does some clever things [...]

1 comment Read the full post →

Misreading misleading charts: entrepreneur edition

June 18, 2009

Paul Kedrosky writes about a study on the rate of entrepreneurship among various age groups, which includes the following piece of junk (ch)art: Why is this chart 3D? It contains information in only two spatial dimensions (time and rate), with a third dimension coded by color. To make the chart itself is a purely superfluous [...]

0 comments Read the full post →

Wilmott’s stages of derivatives

June 18, 2009

Wilmott adapts the Kubler-Ross stages of grief to describe derivatives. An excellent read. Confused disbelief: I’m a great believer in education playing a bigger role in derivatives in future. But not the sort of education that we’ve got at the moment. I understand Warren Buffett when he says “The more symbols they could work into their writing [...]

0 comments Read the full post →

When worlds collide

June 17, 2009

I just learned from Andrew Gelman that Mandelbrot wrote a paper on taxonomies… in 1955.

0 comments Read the full post →

Random forecasts (with echoes!)

June 15, 2009

And speaking of forecasts, I’m reminded today of one of my favorite forecasting errors: the echo. This morning, the manufacturing survey missed the forecasted amount, and many pundits commented that it contributed heavily to the market’s fall. Here is a plot of the manufacturing survey level as reported each month in red (prior to any [...]

0 comments Read the full post →

Copulas in squash

June 15, 2009

Ball marks on a squash court form an interesting scatterplot.

0 comments Read the full post →

The science of… traffic jams

June 15, 2009

Paul Kedrosky has shared a video demonstrating how traffic jams self-propogate. New Scientist refers to such incidents as “shockwave traffic jams.” (This fascinates me to the point that I once wrote a paper on it.)

1 comment Read the full post →

Illustrating the importance of data visualization

June 12, 2009

Andrew Gelman discusses research on attitudes toward gay marriage, by state, and notes this graph in particular, which shows the change in opinion over the last 15 years: Critically, he points out that the states which experienced the greatest change in attitude were the ones that already were most receptive. A naive analysis of the [...]

0 comments Read the full post →

Deconstructing the Gaussian copula, part I

June 5, 2009

A number of misconceptions about the Gaussian copula are addressed.

8 comments Read the full post →

Lies, damn lies…

June 2, 2009

A fascinating look at the politics of government statistics from Carl Bialik’s WSJ blog.

0 comments Read the full post →

Nessie?

May 30, 2009

Via TangYauHoong.

0 comments Read the full post →

LARS and the lasso

May 28, 2009

I just came across a paper on LARS, the linear model selection algorithm that’s sweeping the nation. The mathematically and/or masochistically inclined may view it here.* Ok, so it’s not quite that popular, but it is being heralded as one of the biggest advances in linear modelling in a few decades – and that’s saying a [...]

1 comment Read the full post →

Don’t know much about calculus

May 28, 2009

Another excellent guest column by Steven Strogatz for the NYT Wild Side blog. The post delves into the mathematical beauty of the natural world, using love as a knowingly over-simplified metaphor. Although these examples are whimsical, the equations that arise in them are of the far-reaching kind known as differential equations. They represent the most [...]

0 comments Read the full post →

Urban mathematics

May 20, 2009

Zipf’s law is another mathematical phenomenon not entirely unrelated to Benford’s law (in fact, some think that Benford is a special case of Zipf). (Aside, it’s funny how after you discuss something, it seems to pop up everywhere – Kahneman and Tversky would have a lot to say on that, I’m sure.) Zipf’s law is [...]

0 comments Read the full post →

Visualizing randomness

May 19, 2009

Daniel Becker’s diploma dissertation was on the visualization of randomness – finding concrete ways to map the highly abstract idea of random behaviors and patterns. The resulting portfolio is fascinating, even for someone without a statistical background, in particular for the way in which it lends a semblance of order to these inherently chaotic processes. [...]

2 comments Read the full post →

F-tests begone!

May 18, 2009

Andrew Gelman just wrote a blog post regarding interaction terms in multiple regression and concludes: You never have to do an F test. Just forget about that stuff! I find it incredibly refreshing when someone (a professor, no less!) is willing to cut through the math and get down to common sense. And I particularly hate [...]

1 comment Read the full post →

On graphing horse races

May 8, 2009

In response to Andrew Gelman’s call for interesting visualizations of the Kentucky Derby, Megan Pledger created the following graph: I think it’s especially interesting because the data is fictional, based on a few simple rules to simulate horse behavior (that’s right – this is just like a single realization of a Monte Carlo process!). Andrew [...]

0 comments Read the full post →

Monte Carlo: house of cards?

May 8, 2009

The WSJ recently ran apiece on Monte Carlo risk management: Here is how a typical Monte Carlo retirement-planning tool might work: The user enters information about his age, earnings, assets, retirement-plan contributions, investment mix and other details. The calculator crunches the numbers on hundreds or thousands of potential market scenarios, guided by assumptions about inflation, [...]

0 comments Read the full post →

A Derivation of Benford’s Law or: Roll Your Own Fraud Detector

May 1, 2009

An explanation of Benford’s law, which describes how frequently certain first digits should appear.

1 comment Read the full post →