Thursday, May 29, 2008

Relationships between percentages are not linear (usually)

Friendly Atheist had some graphs relating to biblical literalism and some other variables, which in turn came from several posts by razib at Gene Expression (e.g.).

I objected in comments to the fact that the relationships had been fitted by straight lines when the relationships would necessarily be curved, because of the fact that percentages are bounded by 0 and 100%. Sure enough, the curvature that we would expect to see was plainly there in some of the graphs.

Now, leaving aside a whole lot of other issues (some of which are addressed in the Friendly Atheist thread, some of which aren't), one way to make the relationships easier to see is to transform the percentages in order to stretch out the values close to the boundaries. One popular transform for proportions is the logit transform (log-odds-ratio).

Let's see its effect on the graph of the relationship between percentage of people with postgrad education and percentage who take the bible literally.

Original:


The R2 value is the square of the linear correlation between the two variables. But the relationship is strongly nonlinear! There's little value in this number.

As I said over in the Friendly Atheist comment thread:
These variables should NOT be having straight lines fitted to their relationships, unless someone really thinks percentages can go outside 0-100!

Look at the fitted “percentage with postgrad education” for the “Church of God In Christ”. It’s NEGATIVE! That makes no sense at all.

At the very least, a functional relationship that at least obeys the a priori facts about the situation (those fractions being bounded to [0,1], for example) should be used. In the first graph, the IQs also have a lower bound of 0, but we’re so many s.d.s from zero it doesn’t matter quite so much for that variable (there’s still the issue that there’s no a priori reason to expect linearity, though).

It really does matter for the percentages, because they approach their limits in this data. Notice the actual relationship from the points is curved in the second graph? That’s because the boundaries force it to be curved. Why is a straight line being fitted to a relationship that is plainly not (and worse, pretty obviously won’t be before we even see data)?

The linear equation, the R-squared and so on are all nonsense - worse than useless! (Indeed, since neither variable is necessarily thought to be causative, why use a technique - regression, whether linear or not - that treats one variable as the predictor and the other as the response?)

The original post even refers to the "Roman Catholic" point as an outlier. It isn't!. It only looks like an outlier if you're crazy enough to fit a straight line. If you look at it as a curved relationship, it fits in just fine.

Instead of using a linear correlation, we could measure a nonparametric correlation - one that measures the monotonic association between the two variables. That is, something that measures the extent to which one variable increases while the other decreases. There are many such quantities - two common ones are the Kendall measure of concordance (Kendall's tau) and the rank correlation (Spearman's rho).

Because there is a strong relationship - just not a linear one - the monotonic association is higher than the linear correlation for this data. The linear correlation is -0.86, while the Spearman measure of the monotonic association is -0.92.

However, I think the main issue is simply a better display, so let's return to the graph of the suggested transformation.

Transformed:


I did this one in a bit of a hurry, so it's a little rough, but it gives the idea. In order that we can see the relationship more clearly I have omitted the labels for the individual points, but they could be included. Notice that the "Catholic" point (the one to the right of the "10%" tickmark on the vertical axis) is clearly not an unusual point - it fits the pattern nicely.

Note that since the logit transformation is monotonic, this transformation doesn't alter the nonparametric correlations at all. So, for example, the Spearman measure of monotonic association is still -0.92.

The linear correlation on this transformed scale is, however, changed from what it was before, because now the relationship is now almost linear (it's now -0.90). I still don't advocate drawing a line on the plot, however (though a line is now a pretty good description); if a relationship must be drawn in, any of a number of standard nonparametric smoothers could be used. I think we can see the relationship just fine on the second plot.

As to whether the relationship means much of anything, that's another issue, but at least we can now clearly see it, without some distracting straight line (and equation, and r-squared value) on the graph, mis-relating the raw percentages.

I think both graphs provide valuable information - Ideally, I'd be tempted to display both, side-by-side, and since the table of data is small, to give that as well.

Update:
Here's a plot with an added smooth, done on the original raw percentage scale. This one was generated by an old version of SPlus, but R can also generate stuff like this (as can numerous other packages). The smooth here is just the default spline smooth, though the supersmoother was about as good, and the loess smoother would probably work fine if I tweaked its parameters a bit (the default is too local).



This smooth (and the others I mentioned) done on the original scale don't recognize the inherent restrictions I discuss above; I think a better way to smooth would be to transform the data (like the second plot above), smooth that and then if desired, take that smooth back to the original scale. Of course, it's no longer estimating a mean, but that's not such a huge deal - we're just trying to describe a relationship.

Further Update:
Here's a graph of what happens when you smooth on the transformed scale and transform back to the original (percentage) scale. The blue curve is the smooth curve shown above, while the more strongly bent green curve is the smooth done on the logit-logit scale and then brought back. On the transformed scale the default spline smooth was somewhat curved (though much less so than a spline smooth on the original scale), and of course, when we come back, it's definitely curved on the percentage scale.



We shouldn't extrapolate any of these relationships outside the range of the data, but at least within the range of the data, the smooth curves above are not implausible descriptions of how the variables are related to one another.

Also fixed first two links.

14 comments:

Razib said...

thanks! really good critique; i thought about transforming, but i was just more interested in the general trend.....

Efrique said...

Hi razib,

thanks for coming over to read it.

Yes, I completely agree that what we're interested in here is the general trend. I'm just not convinced that a straight line fit is a useful way to describe the general trend.

If you're interested in the fact that it's generally decreasing, an unadorned plot, with the comment "the proportion with postgrad degrees generally decreases as the percentace of literalists increases" should cover it. On the other hand if you want to mark something on the plot, I think a standard smooth would help.

There's good free software that does this stuff (fits smooth curves) automatically (though I just used Excel for generating my plot, since I wasn't fitting curves).

I'd suggest R (http://cran.r-project.org/); it requires a bit of effort up front to learn it, but if you're doing a lot of analysis or graphs, it's handy. (There are some other good choices.)

When time permits I may put up a graph with a smooth plotted on it.

Razib said...

yeah, i've used R before...but you know, never even think of using it for my blog related stuff. seems more "work" related ;-) weird how psychology is like that.....

Efrique said...

Well, I put up an example smooth - just a default one. I may do something a bit fancier later if I get time.

Dana Hunter said...

I can haz math genius?

That's really an amazing graph - I may have to snag it for a future post, iffen you don't mind!

Douglas Knight said...

Your attack on percentage regressions seems to be mainly that its implicit error model is incoherent. That doesn't seem like such a strong argument to me. Can you make a positive argument about assumptions implicit in the logit model?

Do you always go to the logit representation first? How do you choose?

Pluralist (Adrian Worsfold) said...

You are right, but Riaz was doing a statistical relationship based on the data available - one line. Yours is that marginal relationship along the way just as in supply and demand diagrams.

Efrique said...

douglas:

I disagree that it's simply an argument about the characterization of the error.

It's a more fundamental disagreement that the model for the underlying relationship (say the model for the underlying mean) is itself incoherent.

A model relating proportions where those proportions approach the boundaries* simply cannot be linear, because it implies that the mean itself goes outside the allowed bounds for the variable.

*(unless it is a pure 45 degree line through diagonally opposite corners of the boundary)

Efrique said...

Dana: steal any of my own graphs any time, unless I specifically indicate otherwise. The first graph isn't mine, of course; I stole it from friendlyatheist who stole it form razib.

Efrique said...

douglas: as for justifying a logit model, while my aim wasn't to justify the logit, just to present an example of a scale which allows us to deal sensibly with the boundaries (by rescaling them to +/- infinity).

The log of the odds ratio is probably the most commonly used scale for modelling proportions for a good reason; for example, in many cases the rescaling provided by the log-odds really seems to measure in some sense how much harder it gets to move proportions as they approach the boundaries.

However, since my aim here was to suggest that rescaling, fitting a smooth curve and scaling back was a good way to summarize the data, the actual transformation wasn't particularly critical for that exercise as long as it satisfied a few conditions. The logit just happens to be commonly used and convenient.

Chris Blanchard said...

"Look at the fitted “percentage with postgrad education” for the “Church of God In Christ”. It’s NEGATIVE! That makes no sense at all."

I disagree. A negative percentage means that a small percentage have anti-postgrad educations (or postgrad anti-educations, if you prefer). Which would imply that these people not only lack education, but they actively cancel out the educations of others. And, of course, anyone who has ever spoken to a Biblical literalist for extended periods of time can tell you that, if you pay attention, you can actually feel yourself growing stupider during the conversation. It makes perfect sense! :)

gary said...

Great post! Thank you.
Just one question and one suggestion:

I usually transform proportions to logits when doing linear regressions, but have wondered about what to do when proportions are also the independent variable. While these are also bounded between 0 and 1, it doesn't seem to me to be an issue, as the predictions will never be outside the 0-1 range.
Why is it preferable to do a linear regression of logit-logit, as opposed to a linear regression of proportion-logit?

The suggestion is that when doing this analysis, it is sometimes recommended that cases be weighted by the inverse of the variance of the logit estimator. This gives more weight to proportions based on larger samples and (I think) that are closer to 0% and 100%.

Thanks, again

gary said...

Sorry, I meant closer the 50% midpoint, as variance increases when approaching 0 and 100%. Or something like that.

Efrique said...

Why is it preferable to do a linear regression of logit-logit, as opposed to a linear regression of proportion-logit?

Simply that one wouldn't generally expect that relationship to be linear a priori, unless the range of x's was bounded well away from the ends. But a smooth-curve fit might work.

As for the weighting, it depends on what scale the variance is constant on, but yes, your point is taken.