Sunday, April 3, 2011

How NOT to regress murder rates on religious belief

This post on reddit's r/atheism did a linear regression of murder rates on "importance of religion" figures (both sets of data from wikipedia).

The poster there also looked at IHDI (inequality-adjusted human development index) and its effect on the relationship.

The poster found a weak (and statistically insignificant) relationship between importance of religion and murder, but after adjusting for IHDI the sign changed (though the relationship remained weak).

But much about the analysis - and hence the conclusions is wrong or suspect.

(I'd normally have replied on reddit, but since this discussion is relatively long for a comment and involves figures, it's better written up elsewhere. Further, since this sort of analysis is the very raison d'ĂȘtre of my benighted blog, it goes here.)

While I usually work in R these days, I'm going to do the calculations for this in a spreadsheet, like the original - so that those looking at the original poster's spreadsheet can follow along.

First, I noticed that the murder rates are highly skew. Since the relationships are fairly weak, this skewness applies to both the conditional and unconditional distribution of murder-rate. This instantly invalidates all the significance-testing, so any conclusions about the significance or otherwise of the relationships goes out the window.

Second, the relationship with importance of religious belief is not monontonic, let alone linear. Any conclusions about the direction of the relationship is meaningless without taking this into account. (In what follows I am going to look at "religion is unimportant" percentages rather than "religion is important" - they mostly add to 100%, or nearly so. I do this for a particular reason, though the other figures should give similar conclusions.)

Third, some of the "religion is unimportant" figures are for countries where religious belief is compulsory or effectively so. Let's take Indonesia as an example. In Indonesia, you must choose one of a small number of religions. Lack of religious belief is not allowed. So some countries are "jammed up" against the origin, and the extremely high religious belief figures are highly suspect. Seriously, everyone in some countries thinks religion is important? Absolutely everyone? (This is one reason why for most of my analyses these days I use Wikipedia's "irreligion" figures instead, as in my previous post.)

The "jamming up against zero" issue tends to make relationships curve there, so I transformed that variable too. The usual transform with percentages is the logit transform but those few suspect "0%" figures make that impossible. I could regularize the logit transform, which usually works quite well, but in this case I just took square roots (in a previous analysis with this type of irreligion figures used here I tried a cube-root transformation, since for low percentages it spreads the figures better (it's more like a logit). With this analysis, either succeeds fairly well, but I figured the square root would be better understood.

Since pictures speak much more clearly, let's look at a picture.
I have split the unimportance of religion data into four ranges - first, high figures (in blue - there's a large gap that makes a convenient breakpoint), then medium (teal) and low (green) figues, and finally the 0% figures (red-brown) which I regard as suspect:

Click for larger image.

(I got the data from Wikipedia again myself and cleaned it a little, as there were some errors in the data that had to be fixed but which shouldn't have affected the original poster's figures.)

We see that the 0% figures are inconsistent with the trend in the low figures, and the low figures show a distinctly different pattern to the higher two groups. The upper two groups are reasonably consistent, however - we could probably use a single straight line to describe both. But on the untransformed scale for religious unimportance, there is s stronger suggestion of changing slope)

The log of murder rate is also not monotonic in IHDI though the change is less spectacular (the relationship between IHDI and "religion is unimportant" percentage is strong and close to linear over a fair portion of the range - but again, not clearly monotonic over the whole range).

All of these issues make the conclusions of the original analysis nonsense.

What can we see? the least religious countries do indeed have a lower murder rate. The question remains as to whether this effect remains after considering IHDI - but here's the final concern, though it's not a statistical issue:

Since IHDI is strongly associated with religious belief, if IHDI is substantively caused by religious belief, IHDI could be mediating the relationship between the other two variables. If religion is causative, it might be "acting through" IHDI to reduce murder rates. So we have to be cautious about concluding it isn't causative if it beccomes insignificant after adjusting for IHDI without some rather in depth analysis (and even then with heavy caveats).

I plan to do a more in depth analysis of these figures in R at some point, which will take account of the nonlinearity properly, via additive models.


snowman250 said...

1. Why do you say that religion is strongly associated with IHDI? IDHI is based on Life expectancy, education per capita and GNI per capita.

2. The original Gallup poll question was not weather the religion is important, but weather the religion is important in your life.

3. Why did you split the importance of the religion into four ranges?

The rest of your method as well makes it sound like you are "model shopping" for a model that will give you the conclusion you want.

Efrique said...

Why do you say that religion is strongly associated with IHDI? IDHI is based on Life expectancy, education per capita and GNI per capita.

In the sense that there's a strong relationship between the two variables. I didn't show the plot for it, but those two variables are more strongly related than either is with murder rate.

I may put an addendum to the original post if I get time later today

The original Gallup poll question was not weather the religion is important, but weather the religion is important in your life.

Yes, I know. More correctly, the numbers being analyzed in the original post were "the proportion of respondents answering 'Yes, important' to a question like 'Is religion an important part of your daily life?'. I cannot reasonably use the entire text of the question when I refer to the response, but necessarily use an abbreviated description. The actual information can be found via the link to your original post which links to the wikipedia page which gives the question at the bottom and links to the gallup poll and so allows anyone to find the details of the question.

I note that you yourself called it "importance of religion by country" in your original post, and wikipedia did the same thing. I followed your lead and wikipedia's in using "Importance of religion" in place of "The proportion of people answering 'yes' to 'is religion an important part of your daily life?'."

Why did you split the importance of the religion into four ranges?

In order to answer the question "is the relationship linear, as the OP assumed?".

My preferred approach would be to use a loess smooth or some other nonparametric smoother, but I wanted a procedure that you could easily carry out in a spreadsheet.

One way to do that is show that the slope is different in different parts of the range. I originally decided to split it into three parts, and choose the right-side split at a natural gap in the x-values (near 40% on the original variable, which was about 0.6 on the transformed scale and then chose another at a round number to the left of that which I arbitratily picked at 0.4), yielding "small", "medium" and "large". If I had been choosing by reference to the data, I'd have put it at 0.3, not 0.4, since that's where close eyeballing would have put the turning point. I subsequently split off the zero points when I realized that I had already argued they wouldn't necessarily fit the pattern of the others (indeed, the argument could apply to all the very low percentages in my analysis, so arguably the split should go higher than immediately to the right of zero).

There's a need to strike a balance between finding where the slope changes (lots of breaks) and having enough data to estimate it (few breaks).

As I said, normally I'd use a loess smooth or a spline fit, but I did it in a spreadsheet rather than in R in order that what I did was easy to replicate. I have since been working with it in R; I thought I'd see if there was an additive model and was trying to choose between using MARS, GAM or ACE to do the analysis - I am currently dealing with a bug in ace() that seems to have crashed my computer and I have been trying to sort whether it's just that I haven't updated ace() recently or if there's a hardware problem with my computer. If I can't resolve that I may move to GAM next.

The rest of your method as well makes it sound like you are "model shopping" for a model that will give you the conclusion you want.

Could you be more explicit about this? I can't see what you're saying is model shopping. To some extent, dealing with the fact that assumptions may be unsatisfied and then finding that they aren't and trying to address it can sometimes look that way.