30 January 2011

Meaningless Averages -- Why We Use Probability Distributions

"Average" is a simple concept; we measure a bunch of things, add the measurements, divide by the number of measurements.

Averages are not always that simple.

Simple: We have two baskets full of apples. To get the average number of apples per basket, we count all the apples and divide by two. We can even count the apples in each basket, divide each number by two, and add the result. We'll get the same number for the average because the two operations are arithmetically equivalent.

(100 + 112) / 2 == 100 / 2 + 112 / 2 == 106

Simple but irrelevant: In the game of golf, we can calculate the average number of strokes per hole. We can do the math, the result is meaningful, but nobody wants to know what it is.

Not Simple: We have a golf cart and we drive it exactly one kilometer, measuring the time it takes. Lets say it takes two minutes, averaging 30kph. We do it again, and this time it takes three minutes, averaging 20kph.

Doing the math, (30+20)/2 is 25. So our average speed over the two trials was 25kph. Right?

Wrong. We travelled two kilometers in five minutes; that's 24kph.

Being able to do the math doesn't mean the average makes any sense. So, what's going on here?

Let's go back to the baskets of apples. The total number of apples is real and meaningful; we can pick them all up and count them. 212 apples as the sum of 100 apples and 112 apples makes sense.

Contrast this with the speed of the golf cart. The sum of 20kph and 30kph is 50kph, but nothing in this system travelled at 50kph. If we have a basket of apples travelling at 20kph and another travelling at 30kph, they aren't travelling together at 50kph. This is the first sign of trouble; can we add the measures and get something that makes sense in context?

There is a way to add speeds and get a meaningful result; to get 50kph, we'd need a train going down the track at 30kph, and the golf cart speeding down the aisle in one of the railcars at 20kph. The cart's ground speed of 50kph would be meaningful (for a few seconds), but dividing that by two wouldn't give us anything that makes any sense.

A notable thing here is that we're dealing with a rate. In general, averages of rates are meaningless or wrong, unless the denominators are the same. (This is a necessary but not sufficient constraint; the denominators can be the same and the average might still be meaningless.) If we restate the golf cart data in minutes per kilometer, we can add the 2 and 3 minutes to get 5 minutes. Dividing by two gives us 2.5 minutes per kilometer which, flipped back, is 24kph. The denominator in this case is a constant one kilometer.

Also, the individual speeds are each averages themselves. The other thing to avoid is averaging averages.

Not Simple: Over some period of time at some specific spot, we can measure the height of the tides. We can get the average tide by adding up the measurements and dividing by their number. What does that average mean, and how would you use it?

Is it the most likely tide? No--that's the mode, a different number over here.

Is it the even money bet, as likely low as high? No, that's the median, a different number over there.

Is it the "fifty-year tide?" No, that's the 98th percentile, way over there.

What are you going to do with an average? What does it mean?

Hint: This is where you invoke the rule that says, "When you're tempted to take an average, find the distribution instead."

Excruciatingly Not Simple: What are "average temperature" and "average rainfall"?

1 comment:

  1. avg of avg... oft hits upon Simpsons Paradox.

    ReplyDelete