04 May 2012

Order in the Distribution

Sample distributions have two independent properties: shape and order. The shape is the list of values in the distribution without reference to the sequence they're in. Order is the sequence they're in. Order has no impact on a distribution's statistical properties looked at in isolation, but it has an effect when we're calculating with multiple distributions.

I've posted about Shape a few times; now it's Order's turn.

Rank Order

There are two ways of looking at the order of a list: sort order and rank order. Both are best described by associating an integer with each list item, where the integers go from 1 to n and n is the number of items in the list - the list count. The list of integers is a permutation index and the difference between the two orders is in what the permutation index will do.

The sort order will permute the list so that the result is the list in order of ascending values. For example:

Item #:      1  2  3  4  5  6
List:       44 42 46 45 41 43
Sort Order:  5  2  6  1  4  3
Permuted:   41 42 43 44 45 46 

The rank order is the inverse of the sort order - it undoes what the sort order does and vice-versa. After sorting a list, its rank order will put it back the way it was. Using the same example:

Item #:      1  2  3  4  5  6
List:       41 42 43 44 45 46
Rank Order:  4  2  6  5  1  3
Permuted:   44 42 46 45 41 43

To convert one of the orders to the other:

S[R[i]] = i

What we're most interested in is rank order because it has the same value relationships as the sample distribution. If we have a sample distribution S and its rank order R, then

if  S[i] > S[j] then  R[i] > R[j]

Correlation

The reason we're concerned about order is that the correlation between two distributions is a function of their respective orders. We want to avoid manufacturing correlation where there is none, and we want to preserve correlation where it exists.

Also, when we're using sample distributions in a simulation, we want to mix up the inputs so that the results are representative of the possible outcomes of the system we're modeling. If their orders are similar, that won't happen.

Take the extreme case of two distributions, both sorted ascending. If we add them together, we'll be adding the smallest values of the one to the smallest values of the other, the largest to the largest and the middling to the middling. The result would be sorted ascending. There wouldn't be any small+large or middling+small, or any other unaligned inputs. The results would not be a sample distribution; it would have too many values at the extremes and too few in the middle.

Let's say we throw two dice repeatedly and add the faces. We'd get all the numbers between 2 and 12 with more results around seven than around two or twelve. When we throw two dice, we're mixing up their order - and that's what gives us valid results.

It's the same with sample distributions. So, before calculating with them, we shuffle them - permute them with different random permutation indexes. And we go to some length to make sure they're different. In general, to be safe, we'll force an incoming distribution to a random rank order, using a generator that's designed to produce unique permutation indexes. If this is done right, we don't have to worry about accidental correlation between our inputs messing up our results.

Sometimes we have inputs that are correlated. Let's say that two of our inputs are the price of oranges and the price of orange juice. There are two things we need to do:

First, when we collect the data, we need to make sure the two prices are lined up in the distributions, so the nth value in the orange price distribution was collected at the same time as the nth value in the juice distribution.

Second, we make sure they stay that way. Sam Savage calls this collection of distributions a SLURP - Scenario Library Unit with Relationships Preserved - and it gets special treatment.

We still have to permute them to avoid correlation with other inputs, but we'll permute all the distributions in a SLURP with the same permutation index. In one step, this preserves intended correlation while avoiding accidental correlation.

Proper management of sample distribution order provides the mixing of inputs we need for simulation, while avoiding accidental correlation and preserving deliberate correlation.

No comments:

Post a Comment