Here's a great analysis from Ben Blatt of the Harvard Sports Analysis Collective. He looked at three well-known sports writers -- Bill Simmons, Rick Reilly and Jason Whitlock -- and performed a lexical analysis to create a statistical representation of their writing styles.
What can you do with that analysis? Well, you can see what descriptive words are most frequently used by each author:
- Simmons: Biggest, Excited, Eventually, Almost, Low
- Whitlock: Spoiled, Several, Allegedly, Particular, Important
- Reilly: Tiny, Large, Very, Nice, Dumbest
Or, perhaps more interestingly, you can attribute unknown works to any of the the authors. On six different papers -- the first and last from the period under consideration, which were left out of the training data -- Ben's model went 6 for 6 in choosing the correct writer. More impressively, there was no doubt - the model was 100% certain of its choices. Ben writes that he was surprised at the accuracy.
This method is more frequently seen in attributing historical works, like Shakespeare or the Federalist Papers. However, Ben's success shows how powerful a statistical analysis can be. There weren't any complex algorithms at work here -- all he did was a (relatively) simple Bayesian analysis. Great and entertaining work.