29 January 2012

The Shape of a Sample Distribution

When I'm writing about sample distributions, I realize that I throw the 'shape' and 'order' terminology around assuming they're understood in this context. What I've never done is a proper treatment of either one, so here's 'shape'.

The use of the term shape comes from its informal use in statistics, referring to the characteristic shape of the graphs used to picture various probability distributions. A distribution’s shape refers to the relationship between values and their probabilities. The normal, or Gaussian, distribution has its characteristic bell shape, the power law distribution has its ski-hill shape, and so on. That also applies to sample distributions, but they just are what they are – we don't give them names.

There are a bunch of things you can say about a sample distribution's shape:

  1. Each sample has a value and different samples may have the same value.
  2. A sample distribution's shape is determined by the sample values independent of their order in the sample vector.
  3. A distribution's shape is what determines its conventional statistical attributes, like mean and median.
  4. Shape is independent of the number of samples in the distribution.

My favourite is this: If two sample distributions have the same shape, the probability of finding a sample whose value is less than any chosen value will be the same for both distributions – allowing for the resolution afforded by the number of samples. A consequence of this is that their histograms, percentile charts and characteristic curves will all match.

The histogram is probably the most familiar of these charts, though people usually get it wrong and Excel makes it hard to get right.

The S-shaped percentile chart, my favourite, plots each sample value against the probability that the actual will be equal to or less than that value. This is good for costs and times. It has a complementary form, for equal to or more than, that's good for things like profit and demand.

The characteristic curve is just the sample values plotted against their rank order. That's the values, in ascending order, plotted left-to-right.

The characteristic curve is interesting for another reason. Conceptually, you can change the number of samples in a distribution, without losing its shape, by changing the scale of the x-axis to go from 1 to the size you want. The new sample values are the height of the curve at each new rank increment. My Excel/VBA DIST resizing function uses the corresponding algorithm and gets really good results.

Have I left anything out?

No comments:

Post a Comment