A collusion by wetpaint and the Altimeter Group has resulted in a fanciful study on social media. Normally, a paper like this wouldn't be worth addressing, but the amount of attention being paid to its questionable conclusion warrants a closer look. And that conclusion is:
[T]his landmark study has found that the most valuable brands in the world are experiencing a direct correlation between top financial performance and deep social media engagement. The relationship is apparent and significant: socially engaged companies are in fact more financially successful.
More than that, before issuing the boilerplate precaution ("While no one yet has the data to determine direct cause and effect..."), the paper boldly states: "Why do social media? We finally have a good answer: Because it pays off." Not content with this finding alone, it goes on to say "we're looking at statistical significance among the world's most valuable brands" - a claim that goes unverified and unsupported by the rest of the paper. Besides, statistical significance does not a real finding make.
Ultimately, to spare you reading through what is about to come, I am left wondering whether social media really drives financial success... or if companies that were already industry leaders before the rise of social media had extra cash to throw into otherwise-frivolous and ultimately unproductive internet pursuits. I'm not necessarily taking a stand either way (thought it will be clear which view I prefer); I'm just pointing out that both claims are equally upheld by the results of this study.
Of course, we know exactly why the paper arrives at its conclusion: like any sponsored study, it is aimed at demonstrating that the authors' time is well spent - in this case, that being groups that provide social media tools and consulting services. The challenge of any study is not to merely express an idea, it is to do so while eliminating doubt in any alternative. Not only does this paper fail in that regard, but the manner in which the analysis was performed leaves me skeptical of its proposition in the first place.
The paper's very methodology is extraordinarily questionable. Consider it's premier chart:
The chart compares two key metrics: "channels" (the number of social media outlets that a firm participates in) and "engagement" (a score of that firm's social presence across all of its channels). The paper states that the engagement score is the sum of the sub-score from each channel it participates in, which is the first problem here.
Imagine someone handing out buckets that contain a random number of marbles, from zero to ten. Two people walk by, the first holding two buckets and the second balancing seven. Would you be at all surprised if the person with seven buckets had more marbles? Of course not - because despite being random, the number of marbles should increase with the number of buckets!
So it is with this study - the engagement score is the sum of the channel sub-scores, so it is not at all surprising or even interesting that firms with more channels have higher scores. I can even predict the equation of the regression line: [score] = [number of channels] * [average score per channel]. If this seems incredibly obvious, it's because it is. The study purports to detect a more-than linear effect, but their finding is so clouded by their methods that it is difficult to discern quantitatively. In order to see it, we have to calculate the difference between the slope of the regression line and the average score per channel, a comparison which this model setup does not make particularly easy.
As a side effect of the odd construction of the engagement score, the study introduced heteroskedasticity into their variables - a phenomenon which compromises many statistical procedures (not to mention the paper's claim of statistical significance). Heteroskedasticity means that a variable does not exhibit constant variance - in this case, the engagement scores of the "many channels" firms are much more widely dispersed than the "few channels" firms.
This isn't surprising - the engagement scores of firms with few channels can range from zero to some relatively low number, while high-channel scores can range from zero to a much larger number. Thus, high-channel scores will exhibit greater variance than low-channel scores. You could also show this with some math: the sum of many random variables will have greater variance than a sum of few random variables.
Two regression lines
In order to illustrate a bulletpoint that "as the number of channels increase, overall engagement increases at a faster rate", the study's authors decided to run two separate regressions on their dataset, as can be seen in the chart above.
Under what circumstances is it okay to run two regressions on subsets of a single dataset?
- When your data comes from disparate sources for which a single model simply isn't applicable to the entire sample, but otherwise...
These companies were not drawn from separate populations - and even if they were, it is possible to incorporate two distinct models within a single regression by using dummy variables. This is preferable under any circumstances, and is what the study should have done once they decided two regressions were the way to go. I do not agree with the decision to split the dataset, since there is not an obvious real-world reason to have done so (the argument that it is the halfway-point is in itself arbitrary). Moreover, it is not trivial at all to examine the difference between two coefficients from two different regressions for statistical significance. The heteroskedacticity alone means the error on the "high channel" coefficient would be very different from that on the "low channel" coefficient, further complicating any comparison.
Here is the study's (footnoted) justification:
Running a regression analysis on the full set of 100 brands resulted in a best fit line that favored companies skewed towards fewer channels. In order to provide a meaningful benchmark, we incorporated a break at six channels, which reflected both the natural data distribution and the average number of channels for all 100 companies. The two resulting trend lines generated stronger regression coefficients, more relevant comparisons for any given peer set, and provided further insights regarding social media behaviors across the range of channel presence.
Now, regressions don't "favor" things - they merely fit data. If a regression is biased toward a certain part of your data, it's because that's a real characteristic of the dataset. Moreover, are they surprised that they got stronger coefficients? Next time, try running a separate regression for each individual company - see how strong the coefficients are then!
What the statisticians really mean is that running just one regression doesn't illustrate the fact that high-channel companies have ever-higher scores - and that they didn't know how to actually incorporate that effect. Let's take a look at what their rejected chart really looked like (fortunately, the data was provided along with the study):
Honestly, I don't think it looks that bad, though I agree it doesn't capture the curvature in the data and the fact that it predicts negative scores means it doesn't fully comply with the real world model. To improve it, there are a few continuous choices the authors could have made before the resorted to chopping the dataset in half. Most simply, they could have employed a polynomial regression, which is as simple as including the square of the number of channels as an independent variable. That would have looked like this:
Looks pretty good to me! And it provides a nice explanation - because I can't justify this model naively (as we know the score should grow linearly, not with the square), the excess growth must be due to more effective social media use. The squared variable, in other words, captures the effect we are trying to measure. We are still left with the heteroskedasticity, however. One way to remove it is to transform the dataset. A common transformation of data such as this is the natural logarithm. The resulting plot looks like this:
A linear regression of log-transformed data is the same as an exponential regression of the original data. Additionally, the transformation has the benefit of more or less removing the heteroskedasticity - if anything, the low channels now have greater variance (though not so much that I'd be too worried).
What really bugs me is that the authors describe their model by saying, "There is an exponential growth in the depth of engagement as the brand extends itself into more and more channels." But then why did they use two linear regressions instead of an exponential regression? Frankly, I think it's a combination of them not knowing how, and "exponential growth" becoming a social media buzzword used without consideration of its meaning.
So we have seen three ways that the regression could have been improved (I include the simple linear regression as an improvement). However, none of them really make up for the fact that the engagement score, as constructed, is a silly thing to compare to the number of channels, as we have seen. There is no interesting information that can be gleaned from the upward-sloping regression, since we expect that outcome anyway.
My options are somewhat limited, given that I only have score and channel data, but nonetheless I propose using the average score per channel as a better metric. We would not expect the average score to increase with the number of channels, and so to the extent that it is correlated, the simple linear regression will capture that effect. This results in the following plot:
Much better! There's no heteroskedasticity and the linear fit looks good. And, helpfully, we can actually learn something useful from the regression coefficients: for every additional channel, firms tend to enjoy an extra half a point in their engagement score.
This is a vastly improved setup to what the authors implied (but did not elaborate) their model did - illustrating that the engagement scores of firms with more channels were higher than one would have naively expected. In the original setup, we had to compare two regression slopes to the average slope across all points - a flawed analysis in and of itself - followed by a comparison of those two differences. Here, we merely need to read the single coefficient.
Social media drives profits
Quotes like these should be red flag in any study:
To be specific, companies that are both deeply and widely engaged in social media surpass their peers in terms of both revenue and profit performance by a significant difference. In fact, these Mavens have sustained strong revenue and margin growth in spite of the current economy. Coincidence? Perhaps, but we’re looking at statistical significance among the world’s most valuable brands.
...especially when no statistical details are provided. Instead, we get this chart:
(It is suggested but not clearly shown that these financial values are all relative to peer groups rather than absolute changes).
These graphs are meant to show that Mavens, or high-engagement, high-channel firms, have better financials than other firms. The implication, as quoted earlier, is that "[social media] pays off."
Here's my alternative: "floundering companies spend less money on social media, and are less effective in that arena." We would also accept "superlative companies have excess cash and spend it on social media." Investing in social media is a cost measured in tens or hundreds of thousands of dollars; revenues are measured in the tens, hundreds or even thousands of millions of dollars. Are we really to think that social media has a thousandfold multiplicative impact?
Can't say it enough - the study even says it itself - correlation does not imply causation. The phrase is thrown around so often it has become cliche and consequently ignored, but it is nonetheless true. And as a rule, the simplest answer is probably the right one. So what do you think - does spending money on Facebook and Twitter result in booming profit, or are profitable companies the only ones who can afford to do so?