28 June 2012

More Evidence for More Graphics

An article in the Harvard Business Review adds more evidence that presenting analytic results as charts instead of numbers improves the interpretation of the data.

Economists Are Overconfident. So Are You reports on a study that makes a good case for just charts and no numbers – charts and numbers produced worse interpretations.

It's a good read

11 June 2012

Sample Distributions for Excel - SDXL Ver.0.4.0

SDXL is a free, open-source Excel Add-in I developed to bring Probability Management techniques and strong array handling to Excel. Most of the functions manipulate and calculate with sample distributions represented as arrays. The arrays passed into functions as arguments can be of several types that the functions detect and handle automatically. The types are:

  • a VBA array
  • an XML string.
  • a CSV (Comma Separated Values) string
  • A cell range

There are functions for converting one format to another and for reading and writing XML files containing sample distributions. There are also simple functions to present sample distributions graphically with histograms, scattercharts and percentile charts (cumulative probability).

The math functions include all the usual arithmentic, boolean and comparison functions. It's also possible to mix array and scalar arguments.

The code to multiply two sample distributions and return the result as a CSV string in a cell is:

= toCSV( sdMul( arrayA, arrayB))

The arguments arrayA and arrayB could be just about anything that can be interpreted as a sample distribution.

There are also a lot of functions that assume the arrays are sample distributions and do statistical stuff. There's a complete set of sampling, sorting and permuting functions dedicated to manipulating the array elements – in particular, their order.

And, of course, there are a bunch of random number generators using both VBA's built-in generator and an efficient Mersenne Twister implementation.

I think everything needed to develop fairly interesting simulations and models is there. There are two things that are conspicuous by their absence: Coherent sets of distributions and distribution time-series. I like to have real use cases to drive a design and since I've been concentrating on project planning and estimating, I haven't seen any of these – yet.

To get SDXL, go to smpro.ca/SDXL .

There's also a Google+ Community at +SDXL Users

10 June 2012

Probability Management for Project Estimates

Calculating with averages, point estimates and expected values runs afoul of the Flaw of Averages. The answers we get aren’t just wrong; in the case of project estimates, they’re hopelessly optimistic. This bears getting hammered on. The math, the formulas, the computation is fundamentally, intrinsically, wrong, on several fronts.

And the errors are one-sided; they underestimate cost and time.

There’s a long dog-eared list of things that contribute to project failure. Most of them have to do with human factors or natural disasters like hundred-year storms – hard problems to deal with. Here, on the other hand, is an easy problem to deal with – just fix the math.

03 June 2012

A Tale of Two Dice

Let’s do a little experiment.

Take two dice, one six-sided, the other ten-sided. If you don’t have the dice and there isn’t a game shop handy, you can always use Excel to simulate the experiment. We’re going to throw both dice many times and each time we’re going to record the maximum value of the two sides that turn up. After a few hundred throws, we’ll calculate the average of the values we’ve recorded.

But do we need to do the calculation? We know that the average of the six-sided throws is 3.5 and that the average of the ten-sided throws is 5.5. The maximum of 3.5 and 5.5 is 5.5.

But let's do the calculation anyway. What we get is an average of about 6.1. That can’t be! So we do it again. And again the average is about 6.1.

So what's going on here? The common sense explanation is that, since you always take the larger of the two values, the combined average should be bigger. How much bigger?

More generally, is there a general-purpose mathematical/statistical function that will produce the probability distribution that's the maximum of two probability distributions? The dice problem is easy – it involves the simplest distribution there is. And yet ...

I dredged the internet looking for a formula and came up empty – except for this valiant effort: http://www.cecs.uci.edu/~papers/iccad06/papers/3C_4.pdf

These guys find a formula that works for a very special shape of distribution that may or may not match the real world and, if I read it right, they force the result into the same shape.

The irony in this paper is that they use simulation to validate the formula. If the simulation results are the measure of the formula's validity, why not just simulate and be done with it, especially since simulation solves the problem for any shape of distribution?

That's one reason I say "Don't formulate. Simulate!"