<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>This is the Green Room &#187; Math</title>
	<atom:link href="http://www.thisisthegreenroom.com/category/math/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thisisthegreenroom.com</link>
	<description>turn to page three hundred and ninety-four</description>
	<lastBuildDate>Tue, 07 Feb 2012 21:43:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<cloud domain='www.thisisthegreenroom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
		<item>
		<title>High tech&#039;s hottest calling</title>
		<link>http://www.thisisthegreenroom.com/2012/high-techs-hottest-calling/</link>
		<comments>http://www.thisisthegreenroom.com/2012/high-techs-hottest-calling/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 23:45:26 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4480</guid>
		<description><![CDATA[The NYT's Bits blog has a new post on "high tech’s hottest calling:" statistical analysis. The article isn't just about the jobs market, focusing as well on students' increased demand for statistics classes at top universities. The opening anecdote will be familiar to anyone in the field: “Most of my life I went to parties [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2012/high-techs-hottest-calling/"  size="small"   annotation="none"  ></g:plusone></div><p>The NYT's Bits blog has a <a href="http://bits.blogs.nytimes.com/2012/01/26/what-are-the-odds-that-stats-would-get-this-popular/?pagewanted=all">new post</a> on "high tech’s hottest calling:" statistical analysis. The article isn't just about the jobs market, focusing as well on students' increased demand for statistics classes at top universities.</p>
<p>The opening anecdote will be familiar to anyone in the field:</p>
<blockquote><p>“Most of my life I went to parties and heard a little groan when people heard what I did,” says Robert Tibshirani, a statistics professor at Stanford University. “Now they’re all excited to meet me.”</p></blockquote>
<p>But the observation that follows it is quite serious:</p>
<blockquote><p>Computing has become cheap and available enough to process any number of formulas.... What no one has are enough people to figure out the valuable patterns that lie inside the data.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2012/high-techs-hottest-calling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Have your math and eat it, too</title>
		<link>http://www.thisisthegreenroom.com/2012/have-your-math-and-eat-it-too/</link>
		<comments>http://www.thisisthegreenroom.com/2012/have-your-math-and-eat-it-too/#comments</comments>
		<pubDate>Thu, 12 Jan 2012 02:01:33 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[Mathematica]]></category>
		<category><![CDATA[modeling]]></category>
		<category><![CDATA[pasta]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4436</guid>
		<description><![CDATA[Here are two of my favorite things, unexpectedly combined: This is from the slideshow accompanying a brief NYT article on an unusual book called Pasta by Design. The book is about, yes, modeling pasta in Mathematica. (via FlowingData)]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2012/have-your-math-and-eat-it-too/"  size="small"   annotation="none"  ></g:plusone></div><p>Here are two of my favorite things, unexpectedly combined:</p>
<p style="text-align: center;"><img class="aligncenter  wp-image-4437" title="Pasta by Design" src="http://www.thisisthegreenroom.com/wordpress/wp-content/uploads/2012/01/Screen-Shot-2012-01-11-at-8.56.21-PM.png" alt="" width="582" height="443" /></p>
<p>This is from the <a href="http://www.nytimes.com/interactive/2012/01/10/science/20120110_pasta.html?pagewanted=all">slideshow</a> accompanying a <a href="http://www.nytimes.com/2012/01/10/science/pasta-inspires-scientists-to-use-their-noodle.html?_r=1&amp;pagewanted=all">brief NYT article</a> on an unusual book called <em><a href="http://www.amazon.com/Pasta-Design-George-L-Legendre/dp/0500515808/">Pasta by Design</a>. </em>The book is about, yes, modeling pasta in Mathematica.</p>
<p><em>(via <a href="http://flowingdata.com/2012/01/10/geometry-of-pasta/">FlowingData</a>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2012/have-your-math-and-eat-it-too/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brilliant</title>
		<link>http://www.thisisthegreenroom.com/2011/brilliant/</link>
		<comments>http://www.thisisthegreenroom.com/2011/brilliant/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 02:19:25 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[humor]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[thought experiment]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4419</guid>
		<description><![CDATA[Six famous thought experiments, each presented in 60 seconds: Reminds me a bit of the Peabody and Sherman sketches from the old Rocky and Bullwinkle show...]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/brilliant/"  size="small"   annotation="none"  ></g:plusone></div><p>Six famous thought experiments, each presented in 60 seconds:</p>
<p><a href="http://www.thisisthegreenroom.com/2011/brilliant/"><em>Click here to view the embedded video.</em></a></p>
<p>Reminds me a bit of the Peabody and Sherman sketches from the old Rocky and Bullwinkle show...</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/brilliant/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stat is magic</title>
		<link>http://www.thisisthegreenroom.com/2011/stat-is-magic/</link>
		<comments>http://www.thisisthegreenroom.com/2011/stat-is-magic/#comments</comments>
		<pubDate>Thu, 13 Oct 2011 16:40:48 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[copula]]></category>
		<category><![CDATA[magic]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4401</guid>
		<description><![CDATA[I really love the latest post on Lessons from my Twenties, called Stat Is Magic. Sometimes, things are better left as magic.]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/stat-is-magic/"  size="small"   annotation="none"  ></g:plusone></div><p>I really love the latest post on <a href="http://lessonsfrommytwenties.tumblr.com/">Lessons from my Twenties</a>, called <a href="http://lessonsfrommytwenties.tumblr.com/post/11397955349/lesson-12-stat-is-magic">Stat Is Magic</a>.</p>
<p>Sometimes, things are better left as magic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/stat-is-magic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Elegance</title>
		<link>http://www.thisisthegreenroom.com/2011/elegance/</link>
		<comments>http://www.thisisthegreenroom.com/2011/elegance/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 08:28:17 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[Quotes]]></category>
		<category><![CDATA[elegance]]></category>
		<category><![CDATA[Erdos]]></category>
		<category><![CDATA[numbers]]></category>
		<category><![CDATA[proof]]></category>
		<category><![CDATA[puzzle]]></category>
		<category><![CDATA[quote]]></category>
		<category><![CDATA[Tom Lehrer]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4314</guid>
		<description><![CDATA[I'm a huge fan of Tom Lehrer and have mentioned him a number of times before. I just came across an interview with him from 2000 in which he discussed his dual life as a mathematician and performer. I especially loved this quote, on the concept of "elegance" in mathematics: I think the construction part, [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/elegance/"  size="small"   annotation="none"  ></g:plusone></div><p>I'm a huge fan of Tom Lehrer and have mentioned him <a href="http://www.thisisthegreenroom.com/?s=Tom+Lehrer">a number of times</a> before. I just came across an <a href="http://www.sfweekly.com/2000-04-19/news/that-was-the-wit-that-was/">interview</a> with him from 2000 in which he discussed his dual life as a mathematician and performer. I especially loved this quote, on the concept of "elegance" in mathematics:</p>
<blockquote><p>I think the construction part, the math, how to say it, the logical mind, the precision, is the same that's involved in math as in lyrics.... And I guess in music too. It's gotta come out right. It's like a puzzle, to write a song. The idea of fitting all the pieces so it exactly comes right, the right word at the end of the sentence, and the rhyme goes there and not there. Mathematicians, as opposed to natural scientists, are so interested in elegance. That's the word you hear in mathematics all the time. "This proof is elegant!" It doesn't really matter what it proves. "Look at this -- isn't that amazing!" And it comes out at the end. It's neat. It's not just that it's proof, because there's plenty of proofs that are just boring proofs. But every now and then there's a really elegant proof.</p></blockquote>
<p>I have tried many times to capture and define the euphoria of achieving some kind of mathematical beauty or elegance -- almost always in vain. <a href="http://en.wikipedia.org/wiki/Paul_Erd%C5%91s">Erdos</a> said it well:</p>
<blockquote><p>Why are numbers beautiful? It's like asking why is Beethoven's Ninth Symphony beautiful. If you don't see why, someone can't tell you.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/elegance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What&#039;s a 4?</title>
		<link>http://www.thisisthegreenroom.com/2011/whats-a-4/</link>
		<comments>http://www.thisisthegreenroom.com/2011/whats-a-4/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 07:19:50 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[binary]]></category>
		<category><![CDATA[comic]]></category>
		<category><![CDATA[humor]]></category>
		<category><![CDATA[probability]]></category>
		<category><![CDATA[xkcd]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4305</guid>
		<description><![CDATA[I'm starting to feel this way sometimes.]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/whats-a-4/"  size="small"   annotation="none"  ></g:plusone></div><p>I'm starting to feel this way sometimes:</p>
<p><a href="http://xkcd.com/953/"><img class="aligncenter" title="If you get an 11/100 on a CS test, but you claim it should be counted as a 'C', they'll probably decide you deserve the upgrade." src="http://imgs.xkcd.com/comics/1_to_10.png" alt="1 to 10" width="203" height="309" /></a></p>
<p><em>(via <a href="http://xkcd.com/953/">XKCD</a>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/whats-a-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bayes, prior to reading</title>
		<link>http://www.thisisthegreenroom.com/2011/bayes-prior-to-reading/</link>
		<comments>http://www.thisisthegreenroom.com/2011/bayes-prior-to-reading/#comments</comments>
		<pubDate>Tue, 16 Aug 2011 06:19:44 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[applied statistics]]></category>
		<category><![CDATA[Bayes]]></category>
		<category><![CDATA[frequentist]]></category>
		<category><![CDATA[interview]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4199</guid>
		<description><![CDATA[I may have to go pick up this book, which was reviewed in the NYT last week, if only because it opens with a favorite quote from Keynes. Titled The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy (Wow, titles are getting [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/bayes-prior-to-reading/"  size="small"   annotation="none"  ></g:plusone></div><p>I may have to go pick up <a href="http://www.amazon.com/Theory-That-Would-Not-Die/dp/0300169698">this book</a>, which was <a href="http://www.nytimes.com/2011/08/07/books/review/the-theory-that-would-not-die-by-sharon-bertsch-mcgrayne-book-review.html?_r=1&amp;ref=books&amp;pagewanted=all">reviewed</a> in the NYT last week, if only because it opens with a <a href="http://www.thisisthegreenroom.com/2009/it-is-better-to-be-roughly-right-than-precisely-wrong/">favorite quote from Keynes</a>. Titled <em>The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy</em> (Wow, titles are getting ridiculously long! Is the front cover the new inside flap?), it is a history and overview of Bayesian statistics through history and applications.</p>
<p>I have a very interesting relationship with Bayes: I have enormous respect for his work and the theories, but on the other hand I have extraordinary distrust of its practitioners. The mathematics of Bayes' theorem are unassailable, but its Achilles heel is the assumption of a prior distribution. Yes, you get the "right" answer with the right prior, but how do you choose that prior? I am confident that if I were a Bayesian statistician I would have a snappy and confident response, but I'm not. Instead, I'm merely a strongly applied statistician who has endured many interviews containing the following exchange:</p>
<blockquote><p>J: So you use Bayesian statistics. Tell me how you choose your priors.</p>
<p>Interviewee: Well, first I look at the data...</p></blockquote>
<p>And the interview generally ends shortly thereafter. I certainly don't claim to have a good method for deciding on a prior, well, prior to seeing the data -- but if you don't have that, what good is Bayesian statistics? It pretty much boils down to a frequentist method. It actually appears to me that the search for an objective prior has drawn considerably more intellectual horsepower to bear than Bayesian statistics themselves! When properly applied, these methods have incredible power -- look at Van Neumann utility, for example. But when improperly put to work, they can lead to disaster -- and that is not an acceptable outcome. All of statistics is fraught with pitfalls, but at the end of the day I'm much more comfortable debating which [frequentist] tool to apply than I am trying to pretend I haven't seen the data and arguing over whose prior makes more sense.</p>
<p>I'm not trying to start a flame war (and yes, I realize this isn't a post on politics, or something as critically important as iOS vs Andriod). I admire the elegance and abilities of Bayesian statistics. I'm just disappointed in the degree of care often put into its implementation, and find it difficult to justify one choice over another. Not that I particularly admire most frequentist statisticians for their attention to detail, either...</p>
<p>Anyway, I guess it would be appropriate to wonder about the probability that I'll like this book -- what's the prior of that?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/bayes-prior-to-reading/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Defining chaos</title>
		<link>http://www.thisisthegreenroom.com/2011/defining-chaos/</link>
		<comments>http://www.thisisthegreenroom.com/2011/defining-chaos/#comments</comments>
		<pubDate>Fri, 29 Jul 2011 13:51:13 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[butterfly effect]]></category>
		<category><![CDATA[chaos]]></category>
		<category><![CDATA[random]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4101</guid>
		<description><![CDATA[I've been doing a lot of reading on chaos, in particular on the nature of chaotic systems. I was recently trying to explain to a friend why a dynamic system, which can be perfectly captured by a "deterministic" equation, can nonetheless exhibit chaotic behavior. His refusal at first to accept that fact reminded me of my [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/defining-chaos/"  size="small"   annotation="none"  ></g:plusone></div><p>I've been doing a lot of reading on chaos, in particular on the nature of chaotic systems. I was recently trying to explain to a friend why a dynamic system, which can be perfectly captured by a "deterministic" equation, can nonetheless exhibit chaotic behavior. His refusal at first to accept that fact reminded me of my initial skepticism upon being told that statisticians study and characterize randomness -- it just doesn't seem to make sense. It's easy to conflate "randomness" or "chaos" with "unpredictable" (or with each other!) when that's not necessarily the case.</p>
<p>What inspired me to write this, however, was a succinct definition that I came across in a paper which described a form of chaos simply as the following property: In a chaotic system, two different sets of initial conditions which are separated by an arbitrarily small distance will grow exponentially farther apart as the system evolves.</p>
<p>This will be recognizable to some readers as a characterization of the <a href="http://en.wikipedia.org/wiki/Lyapunov_exponent">Lyapunov exponent</a>, and to others as an overly mathematized version of the <a href="http://en.wikipedia.org/wiki/Butterfly_effect">butterfly effect</a>. In any case, despite my familiarity with the subject I found this relatively simple definition to be quite illuminating in its clarity. There's something to be said for the literature majors who have thankfully applied their talents to mathematics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/defining-chaos/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>If only...</title>
		<link>http://www.thisisthegreenroom.com/2011/if-only/</link>
		<comments>http://www.thisisthegreenroom.com/2011/if-only/#comments</comments>
		<pubDate>Thu, 28 Jul 2011 06:30:08 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[comic]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[gambling]]></category>
		<category><![CDATA[humor]]></category>
		<category><![CDATA[mathematician]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4096</guid>
		<description><![CDATA[Via SMBC:]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/if-only/"  size="small"   annotation="none"  ></g:plusone></div><p>Via <a href="http://www.smbc-comics.com/index.php?db=comics&amp;id=2320">SMBC</a>:</p>
<p><img class="aligncenter" src="http://www.smbc-comics.com/comics/20110728.gif" alt="" width="576" height="1198" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/if-only/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data science vs business intelligence</title>
		<link>http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/</link>
		<comments>http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/#comments</comments>
		<pubDate>Fri, 01 Jul 2011 00:12:20 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[chart]]></category>
		<category><![CDATA[data science]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=4028</guid>
		<description><![CDATA[Steve Miller has written a nice two-part piece on data science for Information Management. Part 1 overviews the topic, including links to many pieces that have been profiled on TGR. Part 2 is a more direct comparison of data science and "business intelligence," a somewhat lackluster (but growing) field of data analytics. One quote stood [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/"  size="small"   annotation="none"  ></g:plusone></div><p>Steve Miller has written a nice two-part piece on data science for Information Management. <a href="http://www.information-management.com/blogs/data_science_integration_statistics_databases-10020194-1.html">Part 1</a> overviews the topic, including links to many pieces that have been profiled on TGR. <a href="http://www.information-management.com/blogs/data_science_BI_analytics_big_data_visualizations-10020259-1.html">Part 2</a> is a more direct comparison of data science and "business intelligence," a somewhat lackluster (but growing) field of data analytics.</p>
<p>One quote stood out to me:</p>
<blockquote><p>Although there are many very large data warehouses in the BI world, data science seems obsessed with handling “big data – when the size of the data itself becomes party of the problem.”</p></blockquote>
<p>I actually dislike the popular equivalence of "big data" and "data science". While massive volumes of data -- both in observations (rows) and number of variables (columns) -- certainly necessitated the development of quantitative and infrastructural tools that are central to "data science", the field is by no means limited to large datasets. A good data scientist should be able to find insight in any type of data, big or small. <a href="http://www.kaggle.com/c/overfitting">This Kaggle contest</a> speaks to the dangers of overfitting, a problem which doesn't go away just because the number of observations gets higher. I'm all for the "big data" movement, but "data science" is a larger field than just working with massive datasets.</p>
<p>Miller offers this contrastive chart:</p>
<p><a href="http://www.information-management.com/blogs/data_science_BI_analytics_big_data_visualizations-10020259-1.html"><img class="aligncenter" src="http://cdn.information-management.com/media/newspics/Miller050311_1.gif" alt="" width="440" height="495" /></a>To me, BI is "diet data science". BI is not interested in modeling the processes that generate the observed data; it is interested in correlating the observations themselves. One of the best BI tools that I've used is Tableau, a sort of pivot table on steroids which makes it very easy to graph and view the relationships of various variables. But it doesn't offer much for extrapolating new meaning from the data, or applying insights to new data. BI is what data science would be if there were no latent processes, and showing that "these two things move up together" was a sufficient characterization.</p>
<p>I think Miller's chart does capture the chief differences between the two sciences (except for the very last point) but, again, I see no reason for BI to persist as DS methods become more commonplace and accessible. DS is not different from BI, it is just better. Every linear regression and correlation that produces "exact" BI results (as opposed to this assumption that DS only gives "approximate" answers)  is in fact a key tool in the data scientist's belt.</p>
<p>Part 1 of Miller's piece can be found <a href="http://www.information-management.com/blogs/data_science_integration_statistics_databases-10020194-1.html">here</a>; part 2 is available <a href="http://www.information-management.com/blogs/data_science_BI_analytics_big_data_visualizations-10020259-1.html">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lightning does strike twice</title>
		<link>http://www.thisisthegreenroom.com/2010/lightning-does-strike-twice/</link>
		<comments>http://www.thisisthegreenroom.com/2010/lightning-does-strike-twice/#comments</comments>
		<pubDate>Wed, 20 Oct 2010 23:06:54 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=3899</guid>
		<description><![CDATA[Remember that Bulgarian lottery that happened in a bar that drew the same numbers in consecutive weeks? (TGR covered it extensively here and less extensively here.) Well, it turns out lightning does strike twice: the Israeli lottery had the same winning combinations come up just three weeks apart - though the numbers were drawn in a [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2010/lightning-does-strike-twice/"  size="small"   annotation="none"  ></g:plusone></div><p>Remember that Bulgarian lottery <del datetime="2010-10-20T19:07:13+00:00">that happened in a bar</del> that drew the same numbers in consecutive weeks? (TGR covered it extensively <a href="http://www.thisisthegreenroom.com/2009/adventures-in-probability/">here</a> and less extensively <a href="http://www.thisisthegreenroom.com/2009/lottery-math-is-not-so-easy/">here</a>.) Well, it turns out lightning does strike twice: the Israeli lottery had <a href="http://www.haaretz.com/print-edition/news/israeli-lottery-draws-same-winning-numbers-twice-in-one-month-1.319671">the same winning combinations come up just three weeks apart</a> - though the numbers were drawn in a different order each time.</p>
<p>Too bad the draws were spread out just a little too much for the "play last week's numbers" crowd to score a big win. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2010/lightning-does-strike-twice/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Benoit Mandelbrot, 1924 - 2010</title>
		<link>http://www.thisisthegreenroom.com/2010/benoit-mandelbrot-1924-2010/</link>
		<comments>http://www.thisisthegreenroom.com/2010/benoit-mandelbrot-1924-2010/#comments</comments>
		<pubDate>Mon, 18 Oct 2010 23:16:51 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Mandelbrot]]></category>
		<category><![CDATA[obituary]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=3891</guid>
		<description><![CDATA[Benoit Mandelbrot had a greater academic impact on my life than perhaps any other person. I was deeply saddened to learn he had passed away. The NYT has prepared an obituary.]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2010/benoit-mandelbrot-1924-2010/"  size="small"   annotation="none"  ></g:plusone></div><p>Benoit Mandelbrot had a greater academic impact on my life than perhaps any other person. I was deeply saddened to learn he had passed away.</p>
<p>The NYT has prepared an <a href="http://www.nytimes.com/2010/10/17/us/17mandelbrot.html?_r=1">obituary</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2010/benoit-mandelbrot-1924-2010/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Risk &amp; risk management</title>
		<link>http://www.thisisthegreenroom.com/2010/risk-risk-management/</link>
		<comments>http://www.thisisthegreenroom.com/2010/risk-risk-management/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 06:15:06 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[Risk]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[HHG2G]]></category>
		<category><![CDATA[Hitchhiker's Guide to the Galaxy]]></category>
		<category><![CDATA[process]]></category>
		<category><![CDATA[risk management]]></category>
		<category><![CDATA[risk measure]]></category>
		<category><![CDATA[risk metric]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[VaR]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=3791</guid>
		<description><![CDATA[An overview of financial risk and the risk management process.]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2010/risk-risk-management/"  size="small"   annotation="none"  ></g:plusone></div><p>In the last few weeks, I've been asked more questions about risk and risk management than I recall hearing in the last year, and at no time has that been more clear than on a day that saw global indices fall 4%. For something we refer to so often, "risk" has proved an elusive concept. Still, it appears every day in the media, not to mention our own conversations. But what is "risk", exactly?</p>
<h2>What is "risk"?</h2>
<p>We can't even begin to discuss risk management without a clear understanding of the underlying concept itself. (To be clear, I'm going to talk about financial risk: that which is associated with a specific investment or portfolio. This includes risk due to market forces as opposed to operational or liquidity constraints.) Many possible definitions of "risk" may spring to mind:</p>
<ul>
<li>The most you can lose on an investment</li>
<li>The most you can lose on an investment, with some confidence level <em>alpha</em></li>
<li>The average return of the investment</li>
<li>The market value of an investment</li>
<li>The notional value of an investment</li>
<li>A one-standard deviation loss</li>
<li>A six-standard deviation loss</li>
<li>The chance that a company goes bankrupt </li>
<li>The chance that a counterparty goes bankrupt</li>
<li>The chance that you go bankrupt</li>
</ul>
<p>These are all very useful ideas -- we'll talk about why in a second -- but they dance around the issue. They are merely shadows or projections of financial risk. I list them here because ultimately "risk" must be defined in a way that is consistent with all of these projections; in fact it must actually encompass them all. In order to complete that definition, we'll need to borrow some statistical thinking -- but no math, don't worry.</p>
<p><strong>I propose that "risk" is a distribution of probable outcomes</strong>. Specifying "probable outcomes" is somewhat redundant because, in a statistical sense, a distribution is a catalogue of every possible outcome as well as its associated probability. Nonetheless I state it explicitly here because it's important to realize that we must consider <em>all </em>outcomes, even those which are extremely unlikely.</p>
<h2>Risk as a distribution</h2>
<p>What does it mean to say risk is a distribution? Put another way, this suggests that if I truly know the risk of an investment, I know the probability of any given outcome. I think that's a fairly broad characterization that satisfies both the requirement of encompassing the examples I listed earlier and an intuitive understanding of the concept. Volatility is frequently substituted for risk, as investors interpret volatility as uncertainty and risk, when viewed as a distribution, represents uncertainty in future outcomes.</p>
<p>We can now discuss the nature of distributions and their study. In some cases, it's actually possible to know the true distribution. Flipping a fair coin is the canonical example, but we can also consider rolling a die or drawing a card. In fact, it should come as no surprise that the entire gambling industry is premised on the idea that the public will only be comfortable putting their money at risk if they feel fully informed about possible outcomes. With a coin, there are two outcomes, for argument's sake let's say 0 and 1, and each has a 50% probability of being realized. That's it, we just fully characterized the risk in this investment with a simple Bernoulli distribution. How about the die? There are six outcomes -- for simplicity let's say {1, 2, 3, 4, 5, 6} -- and each one has a 16.67% chance of realization. Thus, the risk of the investment is fully captured by a six-part uniform distribution.</p>
<p>Coins and dice are nice illustrations, but they are only toy examples. In the real world, the full list of outcomes may be difficult to ascertain and their respective probabilities even harder. This is where statistics enters the picture. At its core, statistics is the study of distributions. All I've received in years of studying is a bunch of tools for analyzing and describing these lists of potential outcomes. If an investment lacks an easily described set of outcomes, we search for clues as to what the underlying distribution could look like. This could include the type of security, its sensitivities to various external shocks, its historical movements, our expectations of the future, etc. From these indications, we can put together an arbitrarily complex picture of an investment's underlying distribution.</p>
<p>Or at least, we <em>think</em> we can. Creating that picture is a little like trying to draw an object based only on its shadow. In statistics, we refer to this as a hidden or latent factor, or one that can not be observed directly. By sifting the data -- the clues -- in the right way, we can gain insight into what characteristics the distribution must have and, subsequently, it's general form.</p>
<h2>Choosing the distribution</h2>
<p>Many distributions have properties called <em>sufficient statistics</em>. These quantities fully characterize the distribution, allowing it to be perfectly (or sometimes approximately) reconstructed without needing to carry around all the data that originally led to its discovery. Some of these summary statistics lurk in plain sight: mean and standard deviation are two of the most obvious. A dataset that follows a normal distribution, or standard bell curve, can be perfectly summed up with these two quantities. For example, if you made a list of the heights of everyone in your office, it would likely lie on a normal distribution (and for example's sake, let's say that is does). If you want to work with that distribution or build any sort of measurement of it, you need to keep a list of all (say) 200 people and their heights.  But if you know it's a normal distribution, all you need is the mean (average) and standard deviation (dispersion around the mean). Those two numbers give you enough information to know the probability of observing any height in your original dataset, without the need to consult the data itself. They are sufficient statistics for the distribution.</p>
<p>For the coin toss, the sufficient statistic is the probability of 50%, which fully describes the underlying Bernoulli distribution. For the die, it is the range [1,6], which characterizes the discrete uniform distribution in question. When the list of potential outcomes deviates from well-known distributions, we have two options:</p>
<ol>
<li>Work with the unknown distribution</li>
<li>Approximate the unknown distribution with a well-known one that has similar properties</li>
</ol>
<p>While it seems like option 1 is the best choice, it can be a dangerous one. Recall that we may not actually know what the underlying distribution looks like; all we have is a picture based on its shadows. If we made mistakes creating that picture, we'll have trouble making informed decisions later. Moreover, we will likely be stuck with a branch of statistics called "nonparametric analysis" that can be difficult to make good use of.</p>
<p>Option 2 is likely the better choice, provided that we can glean enough information about the underlying decision to make an informed choice for the approximating distribution. There is a tendency to always choose a normal distribution, but I think the anti-Gaussian media has beaten that horse to death. Alternatively, there are many families of distributions available; we just want to pick one that describes the investment's outcomes well while retaining a simplicity that makes any math tractable (and, hopefully, easy).</p>
<p>Option 2 also lets us come up with sufficient statistics for the investment. If all investments were normally distributed, then our portfolio analysis would boil down to their means and standard deviations (and correlations with each other, because the portfolio is a multivariate distribution). This assumption drove the mean-variance finance paradigm that was pioneered by Harry Markowitz in the 1950's. Today we try to use more sophisticated distributional assumptions, but the idea remains the same: come up with a simple set of numbers that summarize your data and use them to analyze the whole.</p>
<p>Returning for a second to the height example, imagine I asked you to estimate the probability of a colleague being over 6'5". If you retained the original dataset (option 1), you would start by counting tall people, divide them by the total count and give me your probability estimate. If you used an approximation (option 2), you'd pop the sufficient statistics into a well-known and exhaustively studied equation and know immediately not just the probability but also a measure of confidence in that number. More complicated analyses might be simply impossible without the distributional assumption. When we are unsure of the best approximation, some compromise of options 1 and 2 will result.</p>
<p>It's very important to note that <strong>in describing the distributions or risk of these investments we made no judgments about quality</strong>. Surprisingly, we can't even say whether they are "risky" or "safe"! Despite my claiming that "we know the risk of the investment," all we've done is describe the outcomes; subjective and qualitative assessments are yet to come.</p>
<h2>Risk as a metric</h2>
<p>Once we have some idea of what an investment's distribution of outcomes looks like, we have identified its "risk". But as I've mentioned, we can't yet do anything with that information. We need to create some sort of measurement that allows us to make comparisons and decisions. <em>Risk metrics</em> are those measurements.</p>
<p>Risk metrics are usually <em>summary statistics</em> of the underlying risk distribution. Summary statistics give information about the distribution, but, unlike sufficient statistics, they may not provide enough detail to recreate the distribution entirely. For example, the mean by itself or the standard deviation by itself or the minimum value all give some insight into the distribution but fail to characterize it completely. Frequently, estimates of these summary statistics are the "shadows" from which a picture of the true distribution is formed. When you measure the heights of everyone in your office, the observed mean and standard deviation constitute two of the clues you would use to construct the representative bell curve.</p>
<p>We have now learned enough to understand that the risks I listed earlier were actually summary statistics of an investment's true distribution, or underlying risk. At the risk of redundancy, here they are again with explanations (note that some of these return to the distribution of returns, others to the distribution of portfolio values; it is easy enough to convert between the two):</p>
<ul>
<li>The most you can lose on an investment <em>(the minimum of the distribution)</em></li>
<li>The most you can lose on an investment, with some confidence level <em>alpha (the 1 - <span style="font-style: normal;">alpha</span> quantile of the distribution, also referred to as Value at Risk)</em></li>
<li>The average return of the investment <em>(the mean of the distribution)</em></li>
<li>The market value of an investment <em>(the most recent observation from the distribution)</em></li>
<li>The notional value of an investment <em>(the minimum or maximum of the distribution)</em></li>
<li>A one-standard deviation loss <em>(the standard deviation of the investment)</em></li>
<li>A six-standard deviation loss <em>(the standard deviation of the investment)</em></li>
<li>The chance that a company goes bankrupt <em>(a specific outcome from the distribution and its associated probability)</em></li>
<li>The chance that a counterparty goes bankrupt <em>(a specific outcome from the distribution and its associated probability)</em></li>
<li>The chance that you go bankrupt <em>(a specific outcome from the distribution and its associated probability)</em></li>
</ul>
<p>It is clear that without knowledge of the underlying distribution, none of these quantities can be known. I want to hammer home the difference between knowing <em>risk</em>, the distribution, and <em>risk metrics</em>, summary statistics of that distribution. The distinction is even more important -- and confusing -- because sometimes the summary statistics are observed first and the distribution is inferred thereafter.</p>
<p>I mentioned earlier that volatility is frequently used to describe risk, because of its tie to uncertainty. We can now view it as just one more summary statistic (specifically, standard deviation). However, volatility has a special place in the risk paradigm because it was explicitly labeled as such in the mean-variance paradigm (it's counterpart, return, is played by the mean). That legacy has held and is in many ways justified: more stable returns (less volatility) are associated with return distributions that are well-known and usually characterized by a lack of large losses. As volatility increases, the probability of losses generally increases as well. The distribution becomes more dispersed and various risk metrics take turns for the worse. Thus, volatility is a risk bellwether: easy to calculate and usually indicative of most other metrics.</p>
<p>(Another way to think of risk metrics is as low-dimensional projections of the underlying (and potentially high-dimensional) distribution.)</p>
<h2>Choosing the metric</h2>
<p>And now I'd like you to forget everything we just discussed. In practice, when we talk about "risk" we're referring to risk metrics rather than the underlying distribution. The reason for that is pragmatic: what good does it do to tell someone what the distribution is? Returning to the heights example, knowing the distribution doesn't give you any answers. In fact, if you're a statistician it probably gives you a bunch of questions. Summary statistics (and more advanced results) provide answers. They take the large risk distribution and condense it into a useable form. The appeal is clear: I could tell you every possible outcome of the stock you're about to buy, or I could tell you that you're 90% likely to never lose more than 20%. Which is more useful (putting aside all arguments of whether the latter can truly be known)?</p>
<p>So when we talk about risk we're talking about metrics. How do we choose those metrics? Well, if part 1 of the risk manager's job is to model the underlying distribution, then part 2 is deciding which metrics are useful and calculating them. Needless to say, this part is more art than science. Contrary to popular belief, there is no magic number that contains all risk information and lets you make investment decisions without further analysis. You may have heard of these holy grails, they go by names like "value at risk", "Sharpe ratio", "Sortino ratio", "return over maximum drawdown", "omega ratio", and so forth. These are like weight loss pills -- they make promises grounded in just enough math to either convince or confuse (depending on the customer) and appear to work as advertised on the surface. <em>Caveat emptor</em>.</p>
<p>We have already learned why there is no "one number" solution: because risk metrics are summary statistics and not sufficient statistics. Now, even if they were sufficient statistics for the risk distribution, there still wouldn't be a silver bullet, because the risk distribution does not allow qualitative judgments. It is merely a list of outcomes. If you could condense it to one number, you'd have a number that represented all your outcomes, good and bad, and not necessarily one that would provide an indication of value.</p>
<p>What's really necessary is to look at many of these metrics together. Each one provides some information about the risk distribution, like various shadows from different light sources. By considering many of them at once, our understanding of risk (and equivalently, our picture of the underlying distribution) is enhanced.</p>
<p>There are a couple risk metrics that are always useful:</p>
<ul>
<li>The most you can lose is an important one: investors need to bear in mind that zero is a real possibility. For most cash investments, this will be equal to the market value of the investment. Why isn't this enough? If you bought a million shares of stock and sold a million puts on the same, the max loss on the stock would be greater than that of the options, and you might conclude that the stock was the riskier play. However, I don't know anyone who would agree that buying stock is riskier than selling puts. We reach that conclusion by considering other outcomes of the respective distributions, or other summary statistics.</li>
<li>A reasonable upside estimate is also key. This may not fit the traditional intuition behind a "risk measure", but it would help differentiate between the stock and option portfolios just described. The stock has large potential for gains; the puts are capped. Thus, the downside in the stock is mitigated by the positives but the put's downside -- though almost equal to the stock's -- is not similarly offset. The decision of what constitutes a "reasonable" upside is in the art category rather than science, so unfortunately I can't provide an algorithm.</li>
<li>An understanding of an investment's volatility. Volatility, as mentioned, is like a risk bellwether. As it increases, so does the uncertainty about the future outcomes. Another way to express this idea is to say that the entropy of the risk drops as the volatility increases (this idea hasn't been explored nearly enough in the literature). Popular metrics like the Sharpe ratio try to capitalize on this idea by expressing the "return per unit of risk [volatility]". Presumably, the more risk one takes through an investment, the greater the return that should be received. (This notion took a turn for a disaster when, in late 2008, angry investors wondered why they lost money in stocks as compared to bonds -- the answer (that stocks are more risky) was staring them in the face, but they were accustomed to that risk resulting in greater yields and refused to accept any alternatives.)</li>
<li>Event-driven idiosyncrasies. Is your investment subject to legal/regulatory risk? Operational risk? Other highly-targeted risks unique to that security? If so, the risk distribution becomes much harder to estimate accurately because these characteristics distort it to the point that approximations fail to capture it fully. It is important to understand not only what these idiosyncrasies are, but how they can impact your estimates of risk. As a simple example, consider an illiquid stock that doesn't trade except for a few times a year, when it jumps up or down 15%. Any distributional assumptions should be tossed out the window here; stick with more "nonparametric" qualifications like maximum loss and rely on an excellent understanding of the risk specific to the investment.</li>
</ul>
<p>No discussion of risk metrics would be complete without addressing value at risk. Value at risk, or VaR, was once a celebrated risk metric, introduced to the public by J.P. Morgan in 1994. More recently, it has become demonized and blamed for its contributions to excessive risk-taking and the collapse of many financial institutions. VaR has a clear definition: it represents a level of returns that will only be exceeded some percent of the time, 5% or 1%. In a strict statistical sense, VaR defines the beginning of a distributions tail. Unfortunately, it provides no information about what happens when returns actually exceed VaR and make it <em>into</em> the tail. As more financial institutions came to see VaR as a minimum return, rather than an unlikely-but-still-possible return, they increased the level of risk they were willing to accept. On days when returns exceeded VaR -- and they tended to do so by quite a bit -- those institutions took losses far greater than they ever anticipated were even possible. In other words, they failed to consider that the risk distribution extended past the VaR level.</p>
<p>In a statistical sense beyond the scope of this writing, VaR does not satisfy certain axioms that good risk metrics require (see Artzner's 1999 paper on coherent risk measures). Nonetheless, when used in compliance with its strict definition, it serves as just another summary statistic and can give limited insight to the risk distribution. It is useful to observe the evolution of VaR over time, for example (if VaR increases, risk is increasing, even if the absolute level of VaR is uninteresting). Extensions of VaR like expected shortfall (the average loss, conditional on that loss exceeding VaR in the first place) are also quite useful. An institution is not doing something "wrong" by calculating a VaR; it may be a red flag if they rely <em>solely</em> on the number, however.</p>
<h2>The risk management process</h2>
<p>What I've laid out here is a rather dry blueprint of the risk management process. The procedure is initiated by searching for clues to an investment's underlying distribution. This could be any combination of quantitative (historical or modeled outcomes) and qualitative (fundamental analysis, opinions about the future) factors that provide the "shadows" of the distribution. From these, a complete picture of the distribution is constructed, either through the use of sufficient statistics or tailored models (if the distribution defies simple approximation). Finally, the distribution is used to generate risk metrics that allow investments to be assessed and compared. Those outputs become a critical input for the investment process, as decisions must be made in the context of the portfolio risk, and that risk must not be outsized relative to expected returns.</p>
<p>Once the investment is made, the risk manager will continue to exert influence on the portfolio distribution. For example, if the left tail becomes too big, he may take steps to reduce it by taking offsetting positions, or hedging. If exposure to a specific market force (such as interest rates, or currencies) becomes too large or too small, he may buy or sell securities to bring it back in line. This monitoring process is very important -- the risk of an investment continues to change long after the investment is put on (in fact, you should hope it does, for otherwise nothing has happened at all!)</p>
<p>There are a few key lessons that can be taken from this process:</p>
<ul>
<li>First, an appreciation for the lack of a silver bullet: there is no magic risk number that will protect your portfolio. I'm sorry.</li>
<li>Second, a grasp of the constantly changing nature of an investment's risk. There is no "set it and forget it" in this process.</li>
<li>Third, an understanding of noise vs signal: investments will tend to sample from all over their distributions, both on the upside and down. It is important to observe whether or not the observed returns (themselves summary statistics, or "shadows") match your understanding of the underlying distribution. If they deviate too much, be prepared to consider that your original assumption was wrong and start over.</li>
<li>Fourth, but most important, an understanding that the forest must not be lost for the trees. Seizing on one or two risk measures will inevitably lead to ignorance of the complete distribution (with possibly disastrous consequences). Conversely, trying to compute every summary statistic there is will lead to information overflow and indecision. Risk metrics are tools that provide insight; there's a healthy balance between sparsity and indulgence. Thinking of the metrics as shadows from different lights really is a useful metaphor: too few and some details won't be resolved; too many and the data's redundancy will overwhelm any chance of learning from it.</li>
</ul>
<p>Aside from these tips, I can't stress enough the importance of practicing good risk management. Many investors do it implicitly, as simply understanding each investment is usually tantamount to intuiting its distribution. It doesn't have to be a burdensome regime of additional steps, though many investors will find it useful to ask themselves, as an exercise, "What is the largest loss I can sustain and what is the likelihood of that event? What is the volatility of my portfolio, and am I earning enough to justify that allocation?" and so forth.</p>
<p>The risk management process is not unlike solving a puzzle by piecing together clues and constantly checking that the emerging picture matches up with expectations. I hope this explanation has been satisfactory and not too mathy (you don't want to see me when I'm mathy). There's a richness to the process which I'm afraid I won't be able to describe here -- for your sake and mine -- but I think this should serve as a good jumping-off point for further discussion.</p>
<p>In conclusion, the Hitchhiker's Guide to the Galaxy has this to say on the subject of <a href="http://voices.washingtonpost.com/ezra-klein/2010/06/the_hitchhikers_guide_to_the_g.html">tail risk</a>:</p>
<blockquote><p>The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.</p>
</blockquote>
<ul>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2010/risk-risk-management/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>What is data science?</title>
		<link>http://www.thisisthegreenroom.com/2010/what-is-data-science/</link>
		<comments>http://www.thisisthegreenroom.com/2010/what-is-data-science/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 00:21:53 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=3697</guid>
		<description><![CDATA[The latest in a series of articles on the topic, Mike Loukides of O'Reilly Radar asks, "What is data science?": We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2010/what-is-data-science/"  size="small"   annotation="none"  ></g:plusone></div><p>The latest in a series of articles on the topic, Mike Loukides of O'Reilly Radar asks, "<a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a>":</p>
<blockquote><p>We've all heard it: according to Hal Varian, <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html">statistics is the next sexy job</a>. Five years ago, in <a href="http://oreilly.com/web2/archive/what-is-web-20.html">What is Web 2.0</a>, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?</p>
</blockquote>
<p>The article is excellent, insightful, and long. It's not just an overview, it's an in depth discussion of the who's, how's, what's and why's of data science - and required reading for anyone curious about what we data scientists actually do.</p>
<p>A few phrases that really stood out to me:</p>
<blockquote><p>CDDB views music as data, not as audio, and creates new value in doing so.</p>
</blockquote>
<p>One of the keys to data science is the realization that data is data is data; it doesn't really matter what that data represents. A computer (read: algorithm, test, procedure) is content-agnostic. It just does what it's told. It is up to the scientist -- the human -- to impose meaning and context on the results of the data manipulation. You might run two distinct analyses on the same dataset; or use the same analysis for two very different datasets. The procedure doesn't care and -- critically --  has no way of inferring its own success without a meta-algorithm layered on top of it. It's easiest to let the data scientist be that top layer.</p>
<blockquote><p>The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively -- not just their own data, but all the data that's available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.</p>
</blockquote>
<p>This goes hand-in-hand with my last point: there's no definition of the "right" analysis. Data science is a two-stage process: first, an exploration and second, an implementation (or communication). Repeat.</p>
<blockquote><p>Once you've parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn't always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting?</p>
</blockquote>
<p>There's a nice section, including the above paragraph, on the life-cycle of data itself. The one thing I would add is that data frequently needs to be transformed before it becomes usable. Too many applications today just take data in its raw form and try to correlate it (I'm looking at you, every-application-that-counts-words-in-tweets!). Standardization, whitening, dimension reduction and transformation are important and crucial steps in getting informed results.  If I gave you audio data, you wouldn't just use it as it appears, you'd probably run it through an FFT first. I suppose you could argue that this step of the analysis is actually part of the analysis itself, and not part of the data preparation.</p>
<blockquote><p>The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph.</p>
</blockquote>
<p>Sometimes, sometimes not. The data-visualization/infographic movement in one of the best things that has happened to data science in a long time. Unfortunately, it has also trained us that "pictures are good; simple pictures are better." There's nothing more communicative than a good chart, true, but some datasets belie graphic communication. Multi-dimensional datasets are certainly hard to draw without some process like MDS or projection pursuit. I would argue that for many data applications, visualizations are part of the exploratory process but would/should not be considered a final product. For complex data, visualizations show you the question and how the data relates to it; they may not actually show you the answer.</p>
<blockquote><p>According to DJ Patil, chief scientist at LinkedIn, the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you've just spent a lot of grant money generating data, you can't just throw the data out if it isn't as clean as you'd like. You have to make it tell its story. You need some creativity for when the story the data is telling isn't what you think it's telling.</p>
</blockquote>
<p>This is a really interesting point -- being able to code does not a data scientist make (though it certainly doesn't preclude the possibility). Data science is about creative thinking as much as it is about creative implementation.</p>
<blockquote><p>Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: "here's a lot of data, what can you make from it?"</p>
</blockquote>
<p>I've actually used exactly the same question to describe the field. It is the central, driving objective behind data science and its simplicity speaks to the incredible diversity of projects and pursuits that the field allows.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2010/what-is-data-science/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Beware statisticians bearing gifts</title>
		<link>http://www.thisisthegreenroom.com/2010/beware-statisticians-bearing-gifts/</link>
		<comments>http://www.thisisthegreenroom.com/2010/beware-statisticians-bearing-gifts/#comments</comments>
		<pubDate>Tue, 25 May 2010 01:15:07 +0000</pubDate>
		<dc:creator>J</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[education]]></category>
		<category><![CDATA[hypothesis]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://www.thisisthegreenroom.com/?p=3639</guid>
		<description><![CDATA[The NYT is running a great article about the influx of data in today's world. The prime argument borrows from Einstein's quote, "Not everything that can be counted counts, and not everything that counts can be counted." I think this speaks volumes and should be heeded by the sites that persist in churning out infographics [...]]]></description>
			<content:encoded><![CDATA[<p></p><div class="googlePlusOneButton"><g:plusone href="http://www.thisisthegreenroom.com/2010/beware-statisticians-bearing-gifts/"  size="small"   annotation="none"  ></g:plusone></div><p>The NYT is running a <a href="http://www.nytimes.com/2010/05/16/magazine/16FOB-WWLN-t.htm">great article</a> about the influx of data in today's world. The prime argument borrows from Einstein's quote, "Not everything that can be counted counts, and not everything that counts can be counted."</p>
<p>I think this speaks volumes and should be heeded by the sites that persist in churning out infographics that do little to educate (or illustrate) about anything, except maybe how easy it is to draw monochromatic pie charts. A notable (and humorous) exception may be seen <a href="http://www.flickr.com/photos/philgyford/4505748943/sizes/o/">here</a>.</p>
<p>One of the article's most salient points is that it is not enough to take raw data, run it through a battery of statistical tests, and publish the results. And yes, pie charts are a statistical test. The data must be understood and interpreted - and statisticians will use a first set of tests to illuminate the nature of the data, even before we begin testing hypotheses. After all, how can you answer a question without truly understanding what it is? Remember that any statistical test involves a null hypothesis and an alternative - without understanding exactly what the data represents, it is impossible to properly express those options.</p>
<p>But the statistician's work is not done once the data is understood and the tests are performed - the results of those tests must be interpreted as well. "Lies, damn lies and statistics" isn't just an anecdote - it's truth! Show me a result from a dataset, and I'll show you a convincing way to present an alternative conclusion. It is only by ensuring the integrity of the data and the tests, by knowing exactly what questions are being asked <em>and the manner in which they will be answered</em>, that we can have confidence in our results.</p>
<p>I think it's wonderful that the tools of statistics have become democratized. But we need to make sure that statistical thinking is as widely disseminated as that math. Tools aren't much use without the knowledge to wield them. I can hold a hammer and screwdriver, sure, but I'm no master carpenter. Until we can be confident that our statistics come from statisticians, it will remain necessary to question all analyses. As I write that, I'm well aware in that scenario we'd probably need a healthy dose of skepticism, just the same. Who better to disguise meaning than the master statisticians themselves?</p>
<p>Beware statisticians bearing gifts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thisisthegreenroom.com/2010/beware-statisticians-bearing-gifts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

