Real time and the data society

May 16, 2009 in Internet

The Times has gone all Twitterish - a new feature called Times Wire displays stories as they are published. It is the latest effort to cash in on the growing phenomenon of "real time search." The new Times page implies very simply that recency = importance, which just isn't the case, but somehow our need for instant gratification has spread - with massive errors - to the web. Indeed, real time search is the latest buzzword in the Web 2.0 world, a trend that has been kicked into high gear by the astonishing growth of Twitter. In fact, the top result for "real time search" is Twitter's search page.

Timing is critical in evaluating social information (the sort of details one would acquire from a social site like Facebook, Twitter, etc.).  Given a sample of such data, the "most important" items would lie on two dimensions: familiarity and time. In other words, users care about people they know, and they care about their most recent actions. Therefore, it is hardly surprising that social sites have adopted a consistent framework in which "familiarty" is implied by users through the act of defining networks and "time" is managed by the site as it delivers the most recently generated content. Facebook walls, Myspace comments, Twitter feeds and even blogs in general present data in chronological order. Naturally, the logical conclusion of this process is to include just-published data, hence the rise of "real time."

With droves of users flocking to social sites, it is natural that other, non-social sites would want to emulate the real time paradigm (just as they have tried, with varying success, to incorporate every social fad that comes along). However, only a very small piece of the vast amount of data on the web has "time" as a principal component of its importance, and therein lies the misplaced enthusiasm in real time search. In fact, most information is time agnostic (within reason) - static information can remain relevant days and even years after its first publication.

Techcrunch recently opinionated that real time search is Google's number one priority. But step back a second and think about what it really means for Google to search in real time from a logistical standpoint. Currently, Google's index gets updated as it crawls the web. The faster it crawls, the sooner the index is updated. To include real time information, all Google needs to do is increase its crawl frequency (my phrasing belies the marginal difficulty of the endeavor). Real time search is this sense is not a new paradigm; it is merely the evolution of existing methods. Right now, Google picks up news stories within about 20 minutes of publication; query those feeds more often and voila: real time data. But such an acceleration is as computationally difficult as it is conceptually simple.

The current crop of real time search engines believe they have found a shortcut by reducing the universe of material to recently published items. But this makes the erroneous assumption that time is the principal component of relevance! Unless the search is explicitly for breaking news, this will result in far more noise than signal.

To the extent that recency is a determinant of relevance, it is safe to say it is not the principal component.  A search engine can not start out by eliminating "old" stories. It may, however, give extra weight to more recent stories. But this is a tough trick. Presently, Google asseses relevance with incoming links. It is not trivial to add time to the equation, since nascent pages have fewer links (in almost every case), despite their potential relevance to a query. One possible solution is to use the rate of new links as an indicator. Google presently does something similar to identify "hot" search trends, as does Twitter.  But even this is not quite real time, it's just an educated guess about what will be popular in the near future.

How about this - if more people are linking to something, or discussing it, then it must be more important? Do you really believe that enough people share the same web page through Twitter to create a useful universe of reliable information? If you do, try using Digg as a search engine for a few days. Popular Digg stories have had time to be evaluated (i.e. popularity is not assessed in real time), so one would expect the information there to be somewhat more reliable than Twitter. Of course, a Digg search engine is futile because it simply doesn't encompass enough data. Twitter, while a step above since it enables the easy introduction of micro-thoughts into the digital sphere, is similarly too limited in scope and capacity if we only examine what people are linking to.

Indeed, the key hurdle for real time search as an idealized concept is that the current system is many-to-one, but real time information would have to be interpreted in one-to-many fashion. What I mean is this: when you search for something, you are essentially asking Google to find a single page that many people have indicated is relevant to your query, after the publication of that page. In real time search, one doesn't have time to wait for people to set up links or even for a website to be written. We are interested in the raw story - the single piece of information that made 1000 different Twitter accounts suddenly light up. But how can you point to that information? It could be something that just happened outside, or on tv. The challenge for true real time search engines is not to point to web pages more quickly - it's to point to information that might not even be on the web yet. Such an engine will have to be intelligent enough to see these 1000 different Tweets, distill the common thread, and report that *something* is happening, even if there's no web page for it.

And so we get to something actually worth some excitement.

How can Google point to something for which there is no web page? The current internet, for all its prevalence, does not constitute a fully networked society. News is still reported by people typing stories; photos must be manually uploaded; information must be transcribed, coded and published. But Google is making strides toward integrating the "real world" and the virtual one - their power meter effort is one example, with a telling Lord Kelvin quote headlining the project: "If you can not measure it, you can not improve it." Indeed, the future will be that any action in the physical world will be mirrored in some fashion online, automatically.As more applications move toward a cloud set up, this transition will accelerate - for example your camera's photos will be stored online, not on physical media. The more data that is made available, the less we need to rely on the artificial concept of "web sites" - after all, a web site is nothing more than a place a person or organization has claimed to store data. And who will take care of the organization, aggregation and classification of all that data? Who will make sense of it? It might not be Google, but it will be someone similar. And that will be the true advent of "real time search:" the ability to sift data as it comes in, derive some sense, and report it intelligently.

Welcome to the data society.

Leave a Comment

{ 1 trackback }

Previous post:

Next post: