Wednesday, July 23, 2008

The parable of the histogram

I must be some kind of heretic. I'm a statistician, and here I am pointing out the problems in yet another common statistical tool.

We'll see how the histogram, which is a very popular way of displaying the distributional shape of a set of data, must be viewed with a good deal of caution.

Even though histograms are often found in the media, the problems with histograms are almost unknown among the general public. Indeed, most places that teach statistics at university completely fail to mention them.

I'd like to say that the problems are well known among professional statisticians, but that might be too strong. Certainly problems have been pointed out in the literature, and many statisticians are aware of the problems, but it seems many still are not, and the appropriate cautions are not always explained.

I'm going to show you a simple example.

Here's some data (40 observations in this sample), which I'm going to draw a histogram of. I have rounded the numbers off to two decimal places.
 
3.15 2.28 2.06 3.43 4.85 3.22 4.01 4.43 
5.46 3.12 5.53 5.51 5.56 5.52 5.31 4.96 
3.28 4.10 5.19 2.54 1.89 1.84 2.56 1.90 
4.20 3.42 2.39 3.64 4.84 4.31 5.11 5.60 
1.98 3.91 1.88 4.33 5.74 2.01 2.58 1.92

I give the numbers so you can (if you are so inclined) confirm for yourself what I will tell you in my little parable.
(Edit added Feb 2012: I noticed that the results didn't quite reproduce in R - three observations in the original data set I gave occurred exactly on bin boundaries for some situations. This was either a problem caused by rounding, or possibly by different conventions of different packages for handling observations at bin boundaries;  I have accordingly altered those three observations by tiny amounts to move them off boundaries and avoid the issue, whatever its source. There is R code at the end of the post that works.)

The parable

This data set was given to a student, Annie. She constructs her histogram of the data by counting the number of values between 0 and 1 (but not including exactly 1), between 1 and 2, and so on, and then drawing a series of boxes each of whose base covers the subset of values that the count came from and whose height is the count for that range of values. Annie's histogram is shown in the top-left of the picture below.

She obtains a histogram whose shape corresponds to a distribution that is skewed to the left. See, for example, this description of using histograms to assess distributional shape here (edit: broken link replaced with an alternative) - that's pretty much precisely the way many elementary books on statistics describe the way to assess the shape of a distribution (and usually it's going to give you the right sort of impression).

Note that I could remove the scale and I could still describe the shape - I don't need to know the numbers on the scale in order to arrive at my description.

Three of Annie's friends, Brian, Chris and Zoe (Hah! Psych!) also get data sets with 40 observations, and they all do exactly as Annie did. Their histograms are given below (Annie's data is V1, Brian's is V2 and so on).

(click pic for a larger image)

Correspondingly, Brian describes his distribution as symmetric (and he might add "uniform"). Chris describes his as skewed to the right. Zoe describes hers as symmetric and bimodal (it has two main peaks).

So far so good - this is exactly how the books tell you it all works.

So while they're comparing their histograms, Annie idly starts looking at Brian's actual numbers. She realizes something odd is going on. She quickly places all their data sets side-by-side.

"Look, Chris!" Annie says, "all Brian's values are smaller than mine by 0.25. All yours are a quarter smaller than Brian's, and Zoe's are a quarter smaller than yours!"

They all confirm that she is correct - each set of values is the same, but with its origin merely shifted a little. Their data sets are identical in shape, but the resulting histograms are not.

That is to say, assessment of distributional shape in histograms can be dramatically affected by choice of scale (specifically, by the choice of the origin and width of the histgram bins). Here ends the parable.

It usually isn't this dramatic, of course, but the fact is, if one can generate a seemingly innocuous set of numbers whose histogram will look completely different (and for which many people will assert the distributional shape is completely different) every time we merely add or subtract a quarter, it can happen with real data too. And it does happen. Mostly the difference in impression is more modest... but not always.

So if you see a histogram, just keep in the back of your mind that it's perfectly possible that a different choice of bin boundaries would yield a somewhat different impression of the data.

Imagine I want to show some students that I write "easy" tests (I don't know why this should be such an object of fascination for students since they all do the same test, and marks are generally scaled, but it is). In preparation, I draw a histogram and it turns out to like Chris' - it looks like most students score below the middle of the range of marks. But lo, I discover with a bit of fiddling around that if I make my bin centres where the edges were (and so on), the completely opposite impression is given - just like Annie's histogram. Yay, "easy test" ... and many fewer worried queries from students in the run-up to the test, because they tend to feel there's a good chance of scoring "above the middle".

Did I lie? No. Did I fudge the data? Well, no. I did something, though. Or rather, I didn't do something.

This is a sin of omission. I fail to explain what the data would have looked like given a different choice of bin location.

Clearly, when circumstances are right, the ability to choose the location and width of the bins can give us the opportunity to somewhat alter the impression given by a histogram. Without fudging the numbers themselves, we can sometimes fudge the impression they give.

What do statisticians do? Well, there are other ways to look at distributional shape. Kernel density estimates are popular, and they completely get rid of the "bin-location" issue, though there's still the equivalent of a "bin-width" issue (choice of bandwidth, also called the "window"), which is often dealt with by looking at more than one choice of width (usually a width that gives a nice smooth result and then one that is smaller, giving a "rougher" result, in order that we can see there's nothing unsual hiding away - like the blue and green curves in the graph at top left right** at the wikipedia link a few lines up). But there are a variety of other tools that might be used (which I don't plan on going into here).

**(did I ever mention that I have trouble with correctly attributing the words "left" and "right"? - well as you see, sometimes I do. But not when describing the shape of a distribution, isn't that odd?)


What can you do? Well, assuming you don't have anything more sophisticated that a basic histogram tool, at the least (with continuous data, anyway), try shifting your bin starts forward or back a fraction of a bin-width (if you're lazy, maybe try something near a half, otherwise maybe try a couple of values). Also try a narrower bin width. If you do a few different histograms that all give the same general impression, it doesn't matter much which one you use. And if they don't give the same impression, you better either say so, show more than one, or find some other way to convey the information.

[Or you can do a kernel density estimate readily enough - many packages (including some free ones) will do them; there are pages online that can draw them if you just paste in some data. Implementing a kernel density estimate of your own is fairly straightforward - you can compute one in a spreadsheet easily enough - if anything, it's probably slightly simpler to compute one than it is to compute counts for a histogram, which is in itself pretty straightforward. ]


Caveat Emptor
___

Added in edit in Feb 2012: 

Here is some R code to create the data:

histdata <- c(3.15,5.46,3.28,4.2,1.98,2.28,3.12,4.1,3.42,3.91,2.06,5.53
,5.19,2.39,1.88,3.43,5.51,2.54,3.64,4.33,4.85,5.56,1.89,4.84,5.74,3.22
,5.52,1.84,4.31,2.01,4.01,5.31,2.56,5.11,2.58,4.43,4.96,1.9,5.6,1.92)

Here is some R code to generate the histograms:

opar<-par()
par(mfrow=c(2,2))
hist(histdata,breaks=1:6,main="Annie",xlab="V1",col="lightblue")
hist(histdata-0.25,breaks=1:6,main="Brian",xlab="V2",col="lightblue")
hist(histdata-0.5,breaks=1:6,main="Chris",xlab="V3",col="lightblue")
hist(histdata-0.75,breaks=1:6,main="Zoe",xlab="V4",col="lightblue")
par(opar)

Here is some R code to generate some density estimates:

opar<-par()
par(mfrow=c(2,2))
plot(density(histdata,bw=.2),main="Annie")
plot(density(histdata-.25,bw=.2),main="Brian")
plot(density(histdata-.5,bw=.2),main="Chris")
plot(density(histdata-.75,bw=.2),main="Zoe")
par(opar)

Here is some R code to generate some other informative displays:
First - the sample cumulative distribution function
plot(ecdf(histdata))  

Second, a stripchart that shows the positions of the individual observations as they move back.
x<-c abline="" c="" each="40)" g="" histdata-.25="" histdata-.5="" histdata-.75="" histdata="" pch="|" rep="" stripchart="" v="(2:5),col=6,lty=3)</pre" x="">


end edit

7 comments:

Anonymous said...

Very interesting. I never would have guessed this!

Efrique said...

Thanks for the comment

Indeed - neither did I until someone mentioned the fact to me. (I already knew from trying it that they could look a little different, but I didn't realize how different.)

The vast majority of people who construct histograms would never guess that it could happen. Which is why I explain it.

Unknown said...

Hello Efrique,

I appreciate how you substantiated your arguments with examples and data. Our resident expert on stats is out of town but I believe he has some type or response for you. Good post, good read.

Quan

Anonymous said...

efrique,

Great example, it inspires a follow-on investigation of how MATLAB handles such risks (I'll send you the link when I post it).

Here http://www.blinkdagger.com/matlab/descriptive-statistics-parte-deux#comment-2670 is a reply to your previous comment. Sorry for the length but there was a lot of good material you brought up.

Best,
Rob

Unknown said...

Hi, just stumbled across your post.I thinks it's a good example of how to fool yourself (and others) with statistics.

I kind of miss the one, very basic, solution to the problem though. Essentially, it all boils down to the point of how to choose bins. I think the answer here is rather obvious. Instead of a priori choosing the bin size, simply define it as max(x) - min(x) and your done.

And you don't have to resort to complex issues like kennel functions or, worst of all, start fiddeling with your parameters without rhyme and reason.

And concerning people who use histogramme tools: They won't have to bother about this at all, since any programme computing histogrammes does determine bin width according to the data.

Unknown said...

I meant of course, the bin size should be defined as max(x) - min(x) / nBins.

AprilS said...

Wow! This is so interesting! I just finished watching a video on histograms and posted it to my blog. However, this possibility wasn't even mentioned and I would never have thought the same group of numbers would turn up so many different graphs!

At least I learned all about histograms today!

http://blog.thinkwell.com/2010/08/7th-grade-math-bar-graphs-and-histograms.html