10 June 2011

The Right Way to Do a Histogram

A histogram is not a bar chart with vertical bars.


In a bar chart, the bar provides the graphical representation of a particular value. So the bar might be marked '1995' and the length of the bar would be the GDP for 1995.

In a histogram, the bar represents the number of samples that fall between two values. The important word is 'between'. In a vertical bar chart, axis labels are the value associated with each bar and they're located under each bar; in a histogram, each bar is about what's happening between two values, so labeling each bar with one value is at best ambiguous and usually confusing.

If the value is under the bar, which value is it? Is it the low value? the high value? the average?

No. The labels should be between the bars so that the low and high are obvious. Then the only thing left to interpretation is which of the two is 'or equal'. Because we're going to include a cumulative probability line that expresses the probability of 'this value or less', it makes sense for the bar to express the same relationship so that the high side is 'or equal'.

A picture demonstrates the principle:

Good Histogram

There are a number of things about this presentation that make it ideal:

  1. The x-axis markings are on the lines between the bars. We can see near the center of the bars that the value has about 5% probability of being between 2.6 and 2.8.
  2. In the interests of legibility, we haven't tried to mark every grid line.
  3. We're usually not so much interested in the probability that the project will cost $3 million as we are that it will cost at most $3 million. That's what the cumulative probability plot is about. We can see that there's an 80% probability that the project will cost $3.4 million or less.
  4. The axis labels are in the middle of the chart. This puts them closer to interesting points on the chart.
  5. It puts the median value, the even-money bet--usually the most interesting value--near the center of the crosshairs formed by the two sets of axis labels. We can see immediately that the odds are the same that the project will cost more than or less than $2.8 million.
  6. It's pretty to look at.

It takes a little extra effort to do one of these but it's worth it. A couple of notes:

First, there's one more axis label than there are bars. Histogram widgets usually expect the same number of each.

Second, it's a lot plainer if you make sure the axis label range goes from smaller than the minimum to bigger than the maximum. You can do this in concert with choosing nicely rounded interval values so you aren't marking axes with 10-digit fractions.

3 comments:

  1. Nice post, Marc. We recently added a cumulative probability view to our project and portfolio value uncertainty histograms.

    We still use the traditional "on-the-axis" labels instead of the ones that you have illustrated, but I like your idea. What we do in addition is add lines at the less than 10%, median, and less than 90% probabilities. This allows managers to understand the that there is an 80% probability that the actual value is between the 10% and 90% lines. We have found that explaining it this way makes it easier to understand and it puts "long tails" into their proper perspective.

    You can see a description and example for a project histogram here:

    http://www.help.datamachines.com/content/Prioritization_Dist_F.php

    ReplyDelete
  2. Thanks for the feedback, George. I took a look at your pictures. I find them cluttered, but definitely a step or two in the useful direction.

    I'm not big on confidence interval, mainly because it homogenizes big risk and big opportunity and puts them both outside the interval as if they were equivalent. When you're making a commitment or setting a deadline, which you ultimately must, there's only one point on the curve that's of any interest.

    ReplyDelete
  3. In regards to graphs and charts, I like to think about Edward Tufte's admonition to always try to minimize the quantity of ink you use (real or digital) in visually displaying quantitative information. In his view, if it doesn't add information, it shouldn't be there. For example, he recommends not connecting the intersection of the Y and X axes; i.e. each line should stop at the last labeled tic. In your example, you left them out altogether (good job!).

    I try to keep his advice in mind when I design charts to avoid clutter, but I also like certain conventional things, like connecting the axes. Maybe we'll add the option of putting the tics in the center in the next version.

    ReplyDelete