08 February 2012

What's a Sample Distribution?

A sample distribution is a list of numbers, in most cases lots of numbers – hundreds or thousands of them. Each item in the list holds a possible value of something whose actual value is unknown – an uncertain variable.

How long will it take to complete this project or task? There isn’t just one answer; there’s a whole bunch of answers – each with its own probability of being right. This is an uncertain variable. A sample distribution establishes a connection between all those answers and their probabilities. A sample distribution is a simple and versatile form of probability distribution.

The essential feature of a conventional, parametric probability distribution is a formula. It’s based on expert analysis and curve-fitting of a suitably large number of observations of an uncertain variable. Coupled with a pseudo-random number generator, it produces equally-probable samples that are, hopefully, characteristic of the uncertainty in the real-world variable being modeled.

To make a sample distribution, we do away with the analysis and the formula but keep the observations – renamed ‘samples’ – and put them in a vector (array, matrix, list). We let the real world data speak for itself. There are some constraints on how we do this, so that the sample distribution meets some basic specifications, but that is essentially it.

Each sample in a sample distribution has the same probability as every other of holding the value closest to the variable's value. Also, the number of samples whose values fall into a particular range is proportional to the probability that the uncertain variable will take on a value in that range - sorted, the series of values forms a low-discrepancy sequence.

For simulation with a model, each result sample is the output of the model calculation on the corresponding input samples. Numbering simulation trials, the nth trial will use the nth sample from each of the variables for input to the model calculation, and produce the nth result sample. Calculating like this makes the result a valid sample distribution and every result sample as likely as any other to be the actual outcome.

Array math functions give us the means to calculate with all the samples in parallel, so that the first through nth result samples are produced all at once. Excel’s array formula feature provides that capability – with some limitations. Add-ins provide functions for operating on sample distributions without filling worksheets with columns of numbers.

In coding terms, let's say you have two sample distributions A and B, and array function F, and sample distribution R = F(A, B).

Then R[i] = f(A[i], B[i]), where f is the scalar version of F and i is an index over all of the samples in the three sample distributions.

A sample distribution has two independent attributes: its shape, determined by the sample values, and its order, determined by the sequence in which the values occur in the sample vector.

See The Shape of a Sample Distribution for more about shape.

No comments:

Post a Comment