Posts tagged as:

Bayesian statistics

More mainstream Bayesians

December 20, 2009 in Math

The NYT recently ran an article on the math behind the recent and controversial mammogram advisory change. Unsurprisingly, it is heavily centered on a Bayesian argument. Of course, the key point here is not that the statistics dictated the change, but that budgets and political agendas dictated an acceptable level, which the statistics subsequently informed:

Let’s suppose 100,000 screenings for this cancer are conducted. Of these, how many are positive? On average, 500 of these 100,000 people (0.5 percent of 100,000) will have cancer, and so, since 95 percent of these 500 people will test positive, we will have, on average, 475 positive tests (.95 x 500). Of the 99,500 people without cancer, 1 percent will test positive for a total of 995 false-positive tests (.01 x 99,500 = 995). Thus of the total of 1,470 positive tests (995 + 475 = 1,470), most of them (995) will be false positives, and so the probability of having this cancer given that you tested positive for it is only 475/1,470, or about 32 percent! This is to be contrasted with the probability that you will test positive given that you have the cancer, which by assumption is 95 percent.

{ 0 comments }

Living in a Bayesian world

October 30, 2009 in Math

Increasingly, I’ve noted in my discussions with statisticians and practitioners a reliance on Bayesian methods. Bayesian statistics rely on an understanding of the uncertainty of a hypothesis. For example, Bayesian hypotheses are literally updated as new information becomes available. Bayesian analyses will also rely heavily on conditional probabilities, or the understanding of likelihoods that depend on the occurrence of related events. One of the biggest Bayesian proponents is Professor Andrew Gelman, who maintains an excellent blog and is involved in fivethirtyeight.com.

In some ways, Bayesian methods have become a bit fad-like and, as with many fads (I’m looking at you, VaR), there should be concern that they will be applied blindly, without thought. Like anything else, it’s possible to do Bayesian statistics wrong – and even extremely wrong – but when wielded correctly, they make for an excellent investigative resource.

New Scientist has an article on the use – and misuse – of probability in criminal cases. Naturally, it focuses on Bayesian statistics. The key point the article makes is that while it’s important to consider the odds of something happening, it is just as critical to account for the odds of it happening by chance. That may seem contradictory (isn’t an event’s likelihood, by definition, the probability it happens by chance?) so let’s use a classic example, lifted from the article:

You have just tested positive for a disease that affects 1 in every 10,000 people. The test is 99% accurate. On the surface, that sounds like a sound diagnosis, and most people would say they are 99% confident that they do, in fact, have the disease. But consider the following: if every one of the 10,000 people took the same test, then 1 of them would yield a true positive and 99 more would exhibit false positives just by chance. Therefore, among people who have tested positive, there is only a 1% chance of actually having the disease – not the 99% likelihood we naively assumed before!

How does that work – wasn’t there only a 1% chance of the test being wrong? Well, yes – but if you think about it, that 1% chance of error is much larger than the 0.01% chance of having the disease in the first place and the test result must be placed in that context. For the more spatial readers, here is a picture from New Scientist:

The false positive problem is a classic textbook example of how Bayesian reasoning (that is, accounting for the ways in which chance can manifest itself) can affect a seemingly obvious result. It’s a very important consideration which could be overlooked without care. And besides, it makes for interesting pop sci articles.

{ 0 comments }

Via economist Dan Ariely’s blog, this is what Isaac Asimov thought about perceiving the world through data. It is an implicitly Bayesian approach and brings to mind the famous Keynes quote about changing one’s mind. Asimov wrote:

“Don’t you believe in flying saucers, they ask me? Don’t you believe in telepathy? — in ancient astronauts? — in the Bermuda triangle? — in life after death?

No, I reply. No, no, no, no, and again no.

One person recently, goaded into desperation by the litany of unrelieved negation, burst out ‘Don’t you believe in anything?’

‘Yes’, I said. ‘I believe in evidence. I believe in observation, measurement, and reasoning, confirmed by independent observers. I’ll believe anything, no matter how wild and ridiculous, if there is evidence for it. The wilder and more ridiculous something is, however, the firmer and more solid the evidence will have to be.’

{ 0 comments }

Google Reader has become an inexorable part of my daily life. It’s the only way I can keep up with the amount of reading I do each day, and as much as I love the service, there are a few things I miss.

Here’s my wishlist for Google Reader:

Intelligent favorites: Right now, I have a “favorites” folder, which includes feeds I designate as (drumroll) my favorites. My Reader loads the favorites at startup. Determining my favorite feeds automatically would be a trivial exercise for a Bayesian filter (the same sort of mechanism that decides whether email is spam or not). It could even be time sensitive, so that feed I stopped reading a few months ago wouldn’t be included.

Intelligent presentation: Right now, reader has a sort setting called “auto” which moves feeds that post infrequently to the top of the list. This is a nice start, but I think a few extra steps are needed before I make this my default sort. First, the algorithm boosts posts from a little too far back and puts them a little too high on my list. For some folders this works, for others it does not, depending on the rate of publishing. At a minimum, I wish I could adjust the settings. Relatedly, perhaps a different sort method is not the best way of presenting this information – an alternative would be fading out “less interesting” posts, making the posts that I’m more likely to want to read the ones I’m more likely to see as I browse.

Intelligent relevance: Related to the sorting method, frequency of posting is not necessarily how I determine relevance. I would also implement a Bayesian filter here to more intelligently guess what items I find interesting. But that’s not the only way – one of the benefits of a central aggregation system is that Google knows how many other people have read, starred, shared or commented (via Reader) on each post I subscribe to. Surely this should be indicative of relevance, to the extent that my behavior tends to mirror that of other readers.

Saved searches: Let me save searches the way Outlook does, in dynamic folders. This way, I could create a dynamic “Mets” folder which would include a post from a finance blog that nonetheless mentioned David Wright.

Better searches: And while I’m searching, this is a Google product, so why is search so limited? Let me restrict my search by author, title, or content, and let me sort by relevance instead of recency. Time isn’t necessarily the principal component of my search.

Grouping posts: Frequently, the same story is reported by various sources. A quick semantic analysis should be able to identify these posts and group them together, preferably with one of my preferred feeds as the top item. Google News does it. This is a little different from what Gmail does, however, since conversation tracking links emails that are explicitly related and this needs to imply similarity.

Filtering: What if I want to get Engadget’s feed, except for posts that dare mention Apple? Let me set up filters to customize my feeds. Allowing me to save advanced searches would accomplish the same goal, since search recognizes operators like + and – and I can restrict my search to a specific feed. In the meantime, services like Feed Rinse and Yahoo Pipes are my options.

And that’s all I’ve got off the top of my head. Basically, a mix of applied machine learning and explicit parameter definitions aimed at making Google Reader something more than just a chronological list of syndicated news. In particular, the trouble with Google’s current autosort is that when I turn it on, I get the feeling that all is not quite right. A good behind-the-scenes relevance engine will feel “right” because it aligns with what I want to see. Unfortunately, one can’t always depend on users to express what they feel, which is why explicitly defined filtering systems often fail (or at least are suboptimal). Bayesian filters and the like have the advantage of learning behavior; their development and implementation is nothing new and I think there are few areas begging to be addressed in this way as much as Google Reader.

What would you change?

{ 0 comments }

Personalized Yelp ratings

June 24, 2009 in Data, Math

I had a great conversation last night which at one point verged into the pros and cons of various ratings systems. In particular, we discussed the “star+comment” system used by Yelp, in which between 1 and 5 stars can be assigned in addition to a text comment of arbitrary length.

Yelp does some clever things with their rankings, rather than just naively display restaurants with higher average rankings above ones with lower rankings. Most notably, I believe, they use a Bayesian process to asses the accuracy of the mean review. Thus, a 4 star rating based on 100 reviews could be presented above a 5 star rating based on 5 reviews, since there is uncertainty about the veracity of the 5 stars. On top of this, they take into account the people who have left comments (presumably adjusting for other reviews that person has given) as well as the content of the review comments.

Here’s a feature I’d like to see: adjust the rating to account for how Yelp predicts I would rate that restaurant. Lets say I’m looking at a certain restaurant, which has 4 stars. If in the past I tended to disagree with the people who have reviewed this restaurant, then perhaps it should be presented as a 3 or 2 star choice to me.  Or perhaps I rate Italian restaurants very highly but hate sushi; even highest-rated sushi place on Yelp should be given a low rating when I view it. Or perhaps I like small restaurants, or cheap restaurants – give those categories a ratings boost when I view them.

There are a few caveats to this process: first, it requires me to have a reliable ratings history. This is just a necessary way to let Yelp know who I am. Second, the change doesn’t have to be dramatic – even a subtle shift in presented ratings could make a big impact to me. Finally, there are systemic effects at work. If a restaurant is dirty, or rude, then everyone will feel that way whether they’ve agreed in the past or not. These have to be accounted for.

On the whole this should be a relatively easy thing to implement for anyone with a reliable ratings history – and Yelp has plenty of those. For all I know, this would be a case of overfitting and have little real impact – but I think its intriguing enough to try.


{ 1 comment }