I objected in comments to the fact that the relationships had been fitted by straight lines when the relationships would necessarily be curved, because of the fact that percentages are bounded by 0 and 100%. Sure enough, the curvature that we would expect to see was plainly there in some of the graphs.
Now, leaving aside a whole lot of other issues (some of which are addressed in the Friendly Atheist thread, some of which aren't), one way to make the relationships easier to see is to transform the percentages in order to stretch out the values close to the boundaries. One popular transform for proportions is the logit transform (log-odds-ratio).
Let's see its effect on the graph of the relationship between percentage of people with postgrad education and percentage who take the bible literally.
The R2 value is the square of the linear correlation between the two variables. But the relationship is strongly nonlinear! There's little value in this number.
As I said over in the Friendly Atheist comment thread:
These variables should NOT be having straight lines fitted to their relationships, unless someone really thinks percentages can go outside 0-100!
Look at the fitted “percentage with postgrad education” for the “Church of God In Christ”. It’s NEGATIVE! That makes no sense at all.
At the very least, a functional relationship that at least obeys the a priori facts about the situation (those fractions being bounded to [0,1], for example) should be used. In the first graph, the IQs also have a lower bound of 0, but we’re so many s.d.s from zero it doesn’t matter quite so much for that variable (there’s still the issue that there’s no a priori reason to expect linearity, though).
It really does matter for the percentages, because they approach their limits in this data. Notice the actual relationship from the points is curved in the second graph? That’s because the boundaries force it to be curved. Why is a straight line being fitted to a relationship that is plainly not (and worse, pretty obviously won’t be before we even see data)?
The linear equation, the R-squared and so on are all nonsense - worse than useless! (Indeed, since neither variable is necessarily thought to be causative, why use a technique - regression, whether linear or not - that treats one variable as the predictor and the other as the response?)
The original post even refers to the "Roman Catholic" point as an outlier. It isn't!. It only looks like an outlier if you're crazy enough to fit a straight line. If you look at it as a curved relationship, it fits in just fine.
Instead of using a linear correlation, we could measure a nonparametric correlation - one that measures the monotonic association between the two variables. That is, something that measures the extent to which one variable increases while the other decreases. There are many such quantities - two common ones are the Kendall measure of concordance (Kendall's tau) and the rank correlation (Spearman's rho).
Because there is a strong relationship - just not a linear one - the monotonic association is higher than the linear correlation for this data. The linear correlation is -0.86, while the Spearman measure of the monotonic association is -0.92.
However, I think the main issue is simply a better display, so let's return to the graph of the suggested transformation.
I did this one in a bit of a hurry, so it's a little rough, but it gives the idea. In order that we can see the relationship more clearly I have omitted the labels for the individual points, but they could be included. Notice that the "Catholic" point (the one to the right of the "10%" tickmark on the vertical axis) is clearly not an unusual point - it fits the pattern nicely.
Note that since the logit transformation is monotonic, this transformation doesn't alter the nonparametric correlations at all. So, for example, the Spearman measure of monotonic association is still -0.92.
The linear correlation on this transformed scale is, however, changed from what it was before, because now the relationship is now almost linear (it's now -0.90). I still don't advocate drawing a line on the plot, however (though a line is now a pretty good description); if a relationship must be drawn in, any of a number of standard nonparametric smoothers could be used. I think we can see the relationship just fine on the second plot.
As to whether the relationship means much of anything, that's another issue, but at least we can now clearly see it, without some distracting straight line (and equation, and r-squared value) on the graph, mis-relating the raw percentages.
I think both graphs provide valuable information - Ideally, I'd be tempted to display both, side-by-side, and since the table of data is small, to give that as well.
Here's a plot with an added smooth, done on the original raw percentage scale. This one was generated by an old version of SPlus, but R can also generate stuff like this (as can numerous other packages). The smooth here is just the default spline smooth, though the supersmoother was about as good, and the loess smoother would probably work fine if I tweaked its parameters a bit (the default is too local).
This smooth (and the others I mentioned) done on the original scale don't recognize the inherent restrictions I discuss above; I think a better way to smooth would be to transform the data (like the second plot above), smooth that and then if desired, take that smooth back to the original scale. Of course, it's no longer estimating a mean, but that's not such a huge deal - we're just trying to describe a relationship.
Here's a graph of what happens when you smooth on the transformed scale and transform back to the original (percentage) scale. The blue curve is the smooth curve shown above, while the more strongly bent green curve is the smooth done on the logit-logit scale and then brought back. On the transformed scale the default spline smooth was somewhat curved (though much less so than a spline smooth on the original scale), and of course, when we come back, it's definitely curved on the percentage scale.
We shouldn't extrapolate any of these relationships outside the range of the data, but at least within the range of the data, the smooth curves above are not implausible descriptions of how the variables are related to one another.
Also fixed first two links.