17 January 2012

What's a DIST?

What's a DIST?

Uncertainty

A DIST is a portable uncertain variable.

That's the short (cryptic) description.

Most common computer languages make it easy to work with individual numbers. When things get interesting is when we have a bunch of numbers to deal with. It's not usually straightforward if you want to treat an array, vector or matrix as a single entity.

On the other hand, one character or a string of characters are treated as a single entity, and there's lots of support for manipulating strings.

Spreadsheets, the most commonly used analytic tool, draw this distinction; a spreadsheet cell can hold a number, a string, or a formula that evaluates to a number or a string.

If we want to work with an array of numbers, say the sample values in a sample distribution, we're stuck with huge rows or columns of numbers, unwieldy array formulas, and no convenient way to move those bunches of numbers around from one model or system to another.

Compress

Sam Savage's genius was to draw on the fact that a cell could hold a string, and a string could encode a whole bunch of numbers. This lets us deal with a whole sample distribution as an entity. Since it's a string, we can take advantage of the fact that most modern communications and storage are designed to work well with strings and strings move easily between technical platforms (the Internet is built on that feature).

A DIST string can represent an uncertain variable that's easily preserved and moved around - a portable uncertain variable. A lot of problems related to modeling and simulation on a large scale are reduced to records management.

The immediate advantage is that, because a whole distribution fits in one cell, we can build a stochastic model whose structure is exactly the same as its corresponding deterministic model. Where there was a single number in a cell, there's now a DIST. Where there was a scalar calculation there's now an array calculation. Easily doing array calculations is where new tools come into play. Instead of a cell formula (a+b) we do something like (sdAdd(a,b)).

A DIST is an XML string. Using XML not only puts it in a widely supported format, it makes it easy to expand and adapt. The DIST 1.1 standard specifies eight attributes and a compressed encoding of the array of values.

Here's a small DIST:

<dist name="A DIST String" avg="2.3127"
min="-0.83" max="8.5" count="100" type="Double"
origin="DistShaper3 at smpro.ca" ver="1.1" >
lumbTLtnHaI1OzI2TAEp+1KXMNdItkGUyR8rWgyfTa
YAAHL3fwpncXxLawKByEpbYjozTzkSG3A6K3ELtIuJ
MUzUJd1JiWzujZTW1/JHLLk+1a2vW14onEJmRAw6/
gkOP+6f9wRkD+ppFkDBJH4veEcQJzw86jYOdONVD
zf5YJWm0y4ZUX43JoSGIkzkj1jmGPhFsWPgR+NdBP
//TwVu2UsuV4dZ/5I/QzlWbjwXU/YWxnbORN9GhDQie
Y1QHl7vH9Q+AxN7ZcsAAAAA</dist>

Attributes

A DIST's attributes are:

name
A text string used to identify the DIST
avg
The average of the encoded values
min
The minimum value
max
The maximum value
count
The number of sample values
type
The data type of the values, which also prescribes the compression resolution. The DIST 1.1 choices are "Binary", "Single", and "Double", with resolutions of 1 bit, 8 bits, and 16 bits respectively.
origin   
A user-defined text string
ver
The DIST format version - currently "1.1"

Encoding

There are two aspects to the encoding: compression and translation.

Translation breaks the compressed data into six-bit chunks and translates each chunk into an ASCII byte using a variation of the Base64 standard. The resulting text strings become the encoded element of the XML string.

The Binary type isn't compressed. The binary ones and zeros are converted six at a time to the Base64 string.

The Single and Double types have some work done on them. In essence the sample values are converted to 8-bit integers and 16-bit integers, respectively. The bits are fed to the translation stage. The effect is that three Singles (24 bits) are encoded into four characters, and three Doubles (48 bits) are encoded into eight characters.

The conversion between the floating-point sample value and the integer, with some details left out, is

n = (2^b-1)*(v-min)/(max-min)

Where n is the integer result, v is the value and b is the number of bits to encode. In essence, the values are scaled and offset so that they all fit in the range between 0 and 2^b-1. For decoding, the min and max attributes in the DIST can be used to get the values back.

Precision

The choice of 8 or 16 bits is, effectively, a choice of precision. The Single type yields a precision between two and three significant digits (1/256), the Double yields between four and five significant digits(1/65,536). On decoding, the quantizing defines the maximum number of different values the distribution will have. It also defines the minimum difference between encoded values that will result in the decoded values being different.

There are some things that need to be done to improve the standard, but the DIST as it is has proven remarkably useful in many contexts, including some that can only be characterized as extreme.

0 comments:

Post a Comment