The flow of information

June 16, 2009 in Data,Internet

This NYT article on Twitter and Iran sums it all up (emphasis mine):

“We’ve been struck by the amount of video and eyewitness testimony,” said Jon Williams, the BBC world news editor. “The days when regimes can control the flow of information are over.

It's an amazing and deserved accolade for the young service.


To abuse the common metaphor, we're drinking from a fire hydrant... and have absolutely no way to collect the drops we miss. Twitter's data model is horribly broken. Instead of indicating a tweet's value with some indication such as a link (a la Google) or a thumbs up (a la Digg) or a rating (a la Yelp), users throw the letters "RT" in front of it and blast the text out as their own. Each time this happens, the information is echoed - doubled - into the Twitterverse; for the most retweeted messages, the growth will happen exponentially. In the end, we end up with hundreds or thousands of instances of the same datapoint! This is so horribly inefficient (even impossible) from a data organization standpoint. Really, we need to actually recreate data to show support for it?

This is akin to emailing a document to various people for commenting and editing. In order to send it to 5 people, 5 different instances of the document must be created. If those people make modifcations and send it back, there are now 5 unique pieces of data to address. Modern communications systems like Google Wave are aimed at solving this very problem; systems like Twitter exacerbate it.

Go search for Iran on Twitter, or view the #iranelection topic. Most of the messages there are retweets, which is to say noise. Why can't I just "RT" someone else's tweet by clicking a button, in which case the tweet is referenced to all my followers? Why do I need to actually rebroadcast it as my own?

Jeff Clark at Neoformix recently took all the Iran-related tweets from the last 4 days and statistically selected the "most representative tweet" from each of 30 successive time slots. The astounding result is a list in which 24 out of 30 "representative tweets" are RT's. In other words, 80% of the most representative tweets are not even attributed to their own author, but rather to someone else who decided to rebroadcast the message (and whose selection as representative is thus purely by chance)! At least two of the messages are retweets of retweets, and in one case the most representative tweet of an hour block is a retweet of the message from the previous block (itself a retweet!)

Neoformix's "representative tweets" are an excellent approach, if hampered by the abundance of retweets. The retweets can not be stripped out of the data without some work, since for the moment they provide the only metric of a tweet's importance. I was just reading about force-directed edge bundling for graphs; an analogous system would work wonders for Twitter.

I'm getting as tired of writing "retweet" as I am of putting up with this as a data distribution system. It's archaic and incompatible with modern archival systems. Imagine if Google decided that they would rank websites by how many other people copied content - we would be overwhelmed by noise. Fortunately, right now that noise only comes 140 characters at a time.

Jeff does awesome work, in particular with Twitter visualization, so please don't think I'm directing any frustration at him; my rant is solely aimed at Twitter.

Leave a Comment

Previous post:

Next post: