The latest in a series of articles on the topic, Mike Loukides of O'Reilly Radar asks, "What is data science?":
We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?
The article is excellent, insightful, and long. It's not just an overview, it's an in depth discussion of the who's, how's, what's and why's of data science - and required reading for anyone curious about what we data scientists actually do.
A few phrases that really stood out to me:
CDDB views music as data, not as audio, and creates new value in doing so.
One of the keys to data science is the realization that data is data is data; it doesn't really matter what that data represents. A computer (read: algorithm, test, procedure) is content-agnostic. It just does what it's told. It is up to the scientist -- the human -- to impose meaning and context on the results of the data manipulation. You might run two distinct analyses on the same dataset; or use the same analysis for two very different datasets. The procedure doesn't care and -- critically -- has no way of inferring its own success without a meta-algorithm layered on top of it. It's easiest to let the data scientist be that top layer.
The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively -- not just their own data, but all the data that's available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
This goes hand-in-hand with my last point: there's no definition of the "right" analysis. Data science is a two-stage process: first, an exploration and second, an implementation (or communication). Repeat.
Once you've parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn't always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting?
There's a nice section, including the above paragraph, on the life-cycle of data itself. The one thing I would add is that data frequently needs to be transformed before it becomes usable. Too many applications today just take data in its raw form and try to correlate it (I'm looking at you, every-application-that-counts-words-in-tweets!). Standardization, whitening, dimension reduction and transformation are important and crucial steps in getting informed results. If I gave you audio data, you wouldn't just use it as it appears, you'd probably run it through an FFT first. I suppose you could argue that this step of the analysis is actually part of the analysis itself, and not part of the data preparation.
The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph.
Sometimes, sometimes not. The data-visualization/infographic movement in one of the best things that has happened to data science in a long time. Unfortunately, it has also trained us that "pictures are good; simple pictures are better." There's nothing more communicative than a good chart, true, but some datasets belie graphic communication. Multi-dimensional datasets are certainly hard to draw without some process like MDS or projection pursuit. I would argue that for many data applications, visualizations are part of the exploratory process but would/should not be considered a final product. For complex data, visualizations show you the question and how the data relates to it; they may not actually show you the answer.
According to DJ Patil, chief scientist at LinkedIn, the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you've just spent a lot of grant money generating data, you can't just throw the data out if it isn't as clean as you'd like. You have to make it tell its story. You need some creativity for when the story the data is telling isn't what you think it's telling.
This is a really interesting point -- being able to code does not a data scientist make (though it certainly doesn't preclude the possibility). Data science is about creative thinking as much as it is about creative implementation.
Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: "here's a lot of data, what can you make from it?"
I've actually used exactly the same question to describe the field. It is the central, driving objective behind data science and its simplicity speaks to the incredible diversity of projects and pursuits that the field allows.