Twitter, exposed

July 2, 2009 in Data,Internet

Twitter's data model is, interestingly enough, entirely user generated. Hashtags of every variety, retweets, and other methods of ascribing meta-information to tweets have developed outside any formal structural model or standard. The lone first-party implementation is that a "@" prefix links directly to a person, and even that isn't fully functional.

All of my problems with the service stem from a deep-seated paranoia that as it gains acceptance as a communciations tool, people will begin to treat it as a real method for data syndication, something it absolutely does not support. This fear precludes me from recognizing that Twitter excels when used for its stated purpose: telling people, in real time, what you're up to. But as we move beyond this superficial usage, the massive limitations of the system will simply erase data which otherwise would be maintained, catalogued, and available to future users of the tool.

In a series of examples, I would like to demonstrate my frustration with the system as it stands. All of these can be remedied with trivial patches and I remain astounded that they persist at all.

Spurious information via retweet copies

As I've discussed before, retweets are particularly problematic because they create loosely-attributed copies of information without referencing the original idea (in a linked-data sense), thereby flooding Twitter with noise. Consider what would happen if I tweeted "Twitter's data model is broken" and it was retweeted by a handful of my followers; then a handful of their followers also retweet it, and so on. The product of this exponential distribution system is many unique versions of the same message: "Twitter's data model is broken."

Now someone comes along and performs a search for "Twitter" and "data". The first 1,000 results that come up are retweets of my original message. Why? Because each retweet is considered a seperate piece of information by the system - since it has no way to know that they are merely relays of a single, original idea - and they all satisfy the search string.

It should be obvious that a more useful system would display my original tweet at the top of the list, and indicate that it had been retweeted X times, which is why it had been assigned high relevance to the query. The second result would not be someone else's copy of my information, but an entirely different (though relevant) thought. In this system, retweets are an officially-supported mechanism which, rather than simply sending out a new tweet that happens to contain the same words as someone else's tweet, merely rebroadcasts an instance of the original tweet to the retweeter's followers. Same or stronger signal, no additional noise.

Spam via retweet copies

Consider the example:

  1. Person A sends out a tweet ("Twitter's data model is broken").
  2. Person B retweets Person A's tweet.
  3. Person A retweets Person B's retweet of Person's A original tweet.
  4. Repeat ad nauseam.

Again, perform a search for "Twitter" and "data". The results page would be flooded by the message "Twitter's data model is broken", but coming over and over from just two people! This much should be apparant as an extreme case of the first example I described. But this spam could be avoided by not only considering the number of retweets in assessing relevance, but the number of retweets by unique users.

In a data model which views copies as a measure of relevance, the infinite progression described here would be viewed as extremely important. Obviously, that isn't the case; the data model needs to jibe with our intuition, especially in a simple case like this.

Now, to be fair, this could be done in any data model (just send out slight variants of the message, or employ more than two people)... but traditional spam filters would come into play here.

Conversation tracking

Over the course of a few days, two people have a conversation over Twitter. In the meantime, they carry on conversations with other people as well. At the end, I want to follow their conversation from beginning to end.

I can't.

There is no mechanism to follow a conversation (even Facebook has this in a very basic sense as "wall-to-wall"). I could search for messages from one user containing another user's name, and do the same for the second user, and then look back and forth at the two results... but really? Is that an effective mechanism? Just recently, Felix Salmon tried to re-capture a Twitter conversation for publication on his blog and found it quite difficult.

Archiving and retrieval

Finally, there is no mechanism to retreive past data that hasn't been explicitly saved by a user. Searches are unequivacably ordered in chronological order. Apologists will argue that the system asks people "What are you doing?" and is only concerned with right now - but unfortunately the fact that someone enters information in the present tense does not mean it will be read in the same. Moreover, at times it will actually be desirable to find out what someone was doing an hour, a day, a year ago.

Link sharing is one of Twitter's most popular uses. And yet good luck finding a link that wasn't published today. Imagine being told "Oh, I saw this great picture on Twitter... so and so sent it out last week." Are you really going to dig one by one in reverse order through so-and-so's incomprehensible url-shortened links to find it? Wouldn't it be nice to call up that person's history from a certain day?


I'm genuinely scared that in the absence of these key features (or presence of these flaws), Twitter remains unable to act as a true communications tool as we've come to expect in an internet-enabled, data-driven society. It is, in fact, the anti-Google Wave... and I think that's not a good position to be in.

Previous post:

Next post: