29 January 2012

The Shape of a Sample Distribution

When I'm writing about sample distributions, I realize that I throw the 'shape' and 'order' terminology around assuming they're understood in this context. What I've never done is a proper treatment of either one, so here's 'shape'.

The use of the term shape comes from its informal use in statistics, referring to the characteristic shape of the graphs used to picture various probability distributions. The normal, or Gaussian, distribution has its characteristic bell shape, the power law distribution has its ski-hill shape, and so on. That also applies to sample distributions, but they just are what they are – we don't give them names. Well, not necessarily, but there's one I've named Wile E Coyote both for its profile and for the low-probability, high-impact events that terminate his projects. A bit like sub-prime mortgages.

There are a bunch of things you can say about a sample distribution's shape:

  1. A sample distribution's shape is determined by the sample values independent of their order in the sample vector.
  2. A distribution's shape is what determines its conventional statistical attributes, like mean and median.
  3. Each sample in a sample distribution has the same probability as every other of being the one closest to the 'true' value.
  4. Shape is independent of the number of samples in the distribution.

My favourite is this: If two sample distributions have the same shape, the probability of finding a sample whose value is inside any particular value range will be the same for both distributions – allowing for the resolution afforded by the number of samples. A consequence of this is that their histograms, percentile charts and characteristic curves will all match.

The histogram is probably the most familiar of these charts, though people usually get it wrong and Excel makes it hard to get right.

The S-shaped percentile chart, my favourite, plots each sample value against the probability that the actual will be equal to or less than that value. This is good for costs and times. It has a complementary form, for equal to or more than, that's good for things like profit and demand.

The characteristic curve is just the sample values plotted against their rank order. That's the values, in ascending order, plotted left-to-right.

The characteristic curve is interesting for another reason. Conceptually, you can change the number of samples in a distribution, without losing its shape, by changing the scale of the x-axis to go from 1 to the size you want. The new sample values are the height of the curve at each new rank increment. My Excel/VBA DIST resizing function uses the corresponding algorithm and gets really good results.

Have I left anything out?

26 January 2012

Risk is not discovered, it's chosen

Every so often I get a thought that puts an edge on one of my knives, and "Risk is not discovered, it's chosen" is one of them.

When you tell a stakeholder, "This project will be done in seven months," are you honest enough to tell her that there's a risk of not meeting that target, but she shouldn't worry about it, that you've made the decision about how much risk is acceptable on her behalf?

Given whatever method you use for estimating, do you even know what the probability of missing the target is?

Risk is a choice. The big question to ask is, "Has the choice been made knowingly? – And by the right person?"

17 January 2012

What's a DIST?

What's a DIST?

Uncertainty

A DIST is a portable uncertain variable.

That's the short (cryptic) description.

Most common computer languages make it easy to work with individual numbers. When things get interesting is when we have a bunch of numbers to deal with. It's not usually straightforward if you want to treat an array, vector or matrix as a single entity.

13 January 2012

Bake Risk Management into the Plan

How do you manage the risk in a project? The default is a risk management program that's essentially out of band. The project and the risk management are separate tracks, often staffed with different people.

The first problem this creates is synchronization - making sure that changes in the project are reflected in the risk management plan and vice versa. The next problem is that there are two different groups of people with diverse objectives - and only one of them is focused on a successful project.

The probability management solution to this is to bake the risks into the plan with everything else. This way, there's one plan, one process, one team, one manager executing the project.

What does that mean? It means that we include risk events and responses as elements of the plan along with the tasks and milestones.

04 January 2012

Monte Carlo Simulation the Probability Management Way


Let's start at the beginning:

Modeling with Uncertainty

We use computer models to help us foresee the course and consequences of decisions we might make. As it is with all science, the goal is clairvoyance.

If it's well-designed, a model is faithful to the real-world process the decisions will affect; it tells us what's likely to change in the real world if we make this decision or that assumption.

Modeling a process when the inputs are assumed to be correct and exact – using a formula that puts out The One True Answer as a result – is relatively straightforward. It's what we do when we calculate what will happen to our bank balance if we decide to buy an expensive toy. It's also what we do to estimate projects with CPM or PERT, assuring optimistic estimates that help get projects approved.

But it's not always so simple; sometimes a simple mathematical model won't give us what we need to make an informed decision. This is the case when some of the input assumptions are uncertain variables.

02 January 2012

Parametrics are evil

It's reasonable to ask, 'What evidence do you have that this variable has a Normal distribution?'

The only admissable evidence is real-world data. If you have real-world data, you have a sample set that can be used to make a sample distribution. It may even look Normal (or whatever other shape you claim), near the mode, almost.

Eliminate the error introduced by a parametric approximation, and save yourself the pain of having to figure out which of the many possibilities best supports your preconceptions.

Let the data speak for itself, expressed as a sample distribution.

29 December 2011

The Computer Science of Insecurity

If you want to build systems that aren't easily hacked, this is a good article with a great presentation embedded in it.

22 December 2011

Probability Management for Projects

Probability Management improves our odds of success with realistic estimates and attainable targets.

How much will it cost?
How long will it take?

This kind of question doesn't have just one answer; it has a whole bunch of answers, each with its own probability of turning out to be right.

When we pluck one number out of the bunch to use as a target, we're also choosing our probability of success - the probability that we can meet or beat that target.

04 December 2011

Counting Stuff – Statistics by Simulation

Ontario's Lotto 649 lottery involves a draw of 7 different numbers between 1 and 49. Each ticket has 6 different numbers from the same range. Prizes are awarded for various coincidences. This is a good excuse to do some 'statistics by simulation.'

Here's the challenge: What's the probability that at least one number on a ticket will match one of the numbers drawn?

25 October 2011

PMI Symposium Presentation


The Project Management Institute Ottawa Chapter held their annual Symposium last week.

I got to give a presentation on Probability Management for Projects to a couple of hundred project managers. That gave me a captive audience and an hour to talk about banishing the Flaw of Averages from project estimates. The feedback I've had makes it clear that there's a lot of interest.

A related paper and Excel model are at smpro.ca/ProbMan.