Let's start at the beginning:
Modeling with Uncertainty
We use computer models to help us foresee the course and consequences of decisions we might make. As it is with all science, the goal is clairvoyance.
If it's well-designed, a model is faithful to the real-world process the decisions will affect; it tells us what's likely to change in the real world if we make this decision or that assumption.
Modeling a process when the inputs are assumed to be correct and exact – using a formula that puts out The One True Answer as a result – is relatively straightforward. It's what we do when we calculate what will happen to our bank balance if we decide to buy an expensive toy. It's also what we do to estimate projects with CPM or PERT, assuring optimistic estimates that help get projects approved.
But it's not always so simple; sometimes a scalar mathematical model won't give us what we need to make an informed decision. This is the case when some of the input assumptions are uncertain variables.
Before you flip a coin, the outcome, heads or tails, is unknown. That result is an uncertain variable. It's a variable because it can have more than one value, and uncertain because we don't know what the value will be. Our travel time to work tomorrow, the time to complete a development project, and the future price of a commodity are all uncertain variables. If they're inputs to a model, the result should reflect that uncertainty.
Calculating with an uncertain variable is different from calculating with a scalar (a plain variable without uncertainty). Instead of doing the model calculation once with one value of input, we run the model with many values. For each case (called a trial or scenario), we use a different one of the possible values and record the different results.
For models with multiple uncertain inputs (the usual case) we do each scenario with a different combination of inputs. All the different results then give us a picture of the range of possible outcomes from the model (and, if we're any good at modeling, the real world).
In most non-trivial modeling situations, the number of uncertain inputs and the number of possible values of each input make it impractical to model all possible combinations. Instead, we decide how many scenarios we want to simulate and use a random number generator to select the value of each input for each scenario. If we do it right, the random selections will encompass the range of possible input combinations, and the model output will accurately reflect the real-world outcomes and their probabilities. This is what we call Monte Carlo Simulation
That's the theory. In practice, conventional Monte Carlo Simulation and the Probability Management implementation are not much alike.
Conventional Monte Carlo
Conventional Monte Carlo has two phases. We do the preparation phase once; we won't need to repeat it unless our input assumptions change. The simulation phase is when we do the model calculations. We can repeat it without revisiting the preparation phase.
The preparation phase involves analysing the uncertain variables. In each case the modeler assigns the variable a shape – a parametric probability distribution type, such as normal, log-normal or power-law. Then she calculates the parameters that most closely approximate the possible values of the variable. It should go without saying that doing this right needs a firm grounding in statistical analytics.
The choice of shape and parameters are coded as input to the simulation phase. The simulation is a programmed loop that repeats the model calculation as many times as we want scenarios. For each scenario, it uses a psuedo-random number generator and the designated shape function to produce a random value, a sample, for each input variable. It feeds the samples to the model calculation and does something with the result before going around again.
Depending on the tool in use, the simulation might keep all the results and present them as a histogram and/or percentile chart. Or it might just put out an average with a confidence interval. It might also just keep looping until some test of convergence has been passed.
Passing the results of a simulation to another model that will use them as input is application-specific, if it's supported at all. Passing uncertain variable data from one application or platform to another is difficult without something like a DIST – which we're about to meet.
The Probability Management Way
I like to call the PM way of simulation 'Las Vegas Simulation.' It's less sophisticated, more pragmatic, and accessible to anybody who wants to go there. Like conventional Monte Carlo, it has a preparation phase and a simulation phase, but there the similarity pretty much ends.
Probability Management has a data management core. It replaces much of the complexity of Monte Carlo with straightforward records management.
Of all of Sam Savage's contributions, one of the most important is a data type that represents a probability distribution as a sample distribution – a vector of samples from an uncertain variable.
A DIST is an XML string that encodes and describes a sample distribution. It can be stored, indexed, retrieved, updated and communicated as a single entity. It can be stored as an element in an HTML file, a single spreadsheet cell, a single database field, or an element in an XML structure. It includes metadata that can be used for searching and indexing.
<dist name="LRT Cost Projection" avg="4.5878" min="1.1" max="10.5" count="100" type="Double" origin="DistShaper3 at smpro.ca" ver="1.1" > EstjZ0AAOo2NnVObN9SA0SfDDM1m8ZMQqwdHW rLtczMK5SYhbGQyHIrkSPw5MeTDegLXJVf3mI KVyVaaHFOuTH6jNRsYg2UJaJRR+VCdVPhEWyB oHjubOxSzYcVcmBabbtfShIgrd9RBXHx1PUeD REpZS/teOj6jEJ0zvnWm8mGFcgVyu+krTSJQy Ycs7xprLpIweiQ5nfRC/1r2zimQV3EFpDigrj vqNnjCuGAiKau3SE1Yp31ZmUX+CCz//w61Ark AAE76ajYAAAAA</dist>
A DIST can represent any uncertain variable, any shape, including shapes for which there are no parametric generators and that defy any attempt to fit a curve.
Just as important, a DIST specifies the order in which the samples are presented. In conventional Monte Carlo, order is a side effect of the sample generation; sample values and the order they come out can't be separated. In a DIST, order is an explicit and independent attribute of the encoded sample distribution.
Correlation is all about order. Control of order makes it possible for us to manage complex correlation that, with conventional Monte Carlo, would be lost behind a scalar correlation coefficient. One of the easy and fun things to do is to find pairs of entangled uncertain variables that would be disastrously misrepresented by a correlation coefficient, resulting in a disconnection between the model behaviour and the real world.
This scatterplot plots temperature against rate of change in temperature. Their correlation index is minimal. Conventional MC would miss the negative feedback near the temperature extremes; the model results would be misleading.
Sam Savage has a really great example of two distributions that make a happy face. Their coefficient of correlation is a barely-noticeable .008.
We avoid this problem by using the data as it stands.
The DIST Standard is gaining support from vendors of modeling tools. An immediate benefit is the ability to pass sample distributions between the different vendor platforms and applications that support DISTs.
Let the data speak for itself. The preparation phase of Las Vegas simulation is used to prepare DISTs for future use. Instead of Procrustean curve-fitting, we just use the data.
If data is scarce, a subject matter expert can estimate a distribution. They might use an estimating tool like DistShaper.
Given a sample distribution, the next step is to permute it with a unique permutation index, to avoid manufacturing correlation with other model input DISTs. On the other hand, if two or more of the inputs are known to be correlated, we'll permute them with the same index, to preserve that correlation.
Permutation doesn't need interesting generators; all it needs is a uniform random integer generator of the most ordinary kind. This is the only place we need a random number generator.
To be fair, considering the 'unique' in 'unique permutation index', it's not quite so ordinary when we're dealing with very large models or many models whose results are combined. That needs a long-period generator. There's a whole administrative discipline wrapped around handling extreme cases. Vector Economics has a useful service if you have this kind of modeling challenge.
Permutation and resampling can easily be done in a single step. In any case, the result is tucked away as a DIST.
One of the intended uses of DISTs is to collect them into DIST libraries. Useful real world data can be packaged in DISTs and indexed for future retrieval by any model that needs it.
With most of the work already done, there's comparatively little for the simulation to do. This is good, because it's the part that's likely to be repeated many times. Instead of looping through the model hundreds or thousands of times, generating a new set of random input values each time, we do the model calculation just once. The difference is that the model is built to do each calculation using array math with DISTs and sample distributions instead of scalars. We don't need a random number generator because the DISTs are already randomized.
We pack the model outputs into DISTs. Those DISTs can be used to create graphs and other views of the model outcome, they can be passed on to a higher level aggregating model, they can be filed for future reference. The input DISTs, output DISTs and model specification can be packaged up and kept indefinitely for audit or re-use, using whatever records management facilities are already in place.
The Bottom Line
The Probability Management Way, Las Vegas Simulation, is simpler, less of a computer hog, and more easily managed than conventional Monte Carlo. It's also not bound to the technology used; a DIST is a portable uncertain variable that's platform independent.