Discrete Distributions

Using Excel for Statistics

Excel can be quite convenient for doing statistics, so we'll use it unless we need more firepower.

Generating Random Numbers

Let's start simply. The randbetween(min,max) Excel function generates a random number between the given values, every time the worksheet is recalculated. Excel recalculates the worksheet when you change a value or enter a new cell, so this is pretty often.

To save the values, copy them and use the paste special (values) command.

Sorting

It's often useful to sort the random numbers. Select the column (it does need to be in column form) and choose Data > Sort... and choose how you want it sorted.

Histogram

Even better than sorting is to get a histogram of the data. Choose Tools > Data Analysis ... and then choose "Histogram" from the box of tools. You then have to select the data you want histogrammed. You also should select a second set of data that gives the bins you want, because Excels' default are terrible. Click on "chart" if you'd like a graphic of the histogram. Notice the check box for a "Pareto" histogram!

Location and Spread functions

Excel has the following functions built-in, and they work just as you'd think. Just select the range that they are computed over, and they return the desired value.

Excel can also compute spread statistics:

There's no built-in "interquartile range" that I know of, but Excel does have a built-in for the quartiles, so it's easy enough to do. It also does percentiles, and you can use non-integer values and it will interpolate!

Here's how we can compute the interquartile range

quartile(range,3)-quartile(range,1)

Other Discrete Distributions

Lots of kinds of random numbers show up in the world; dice (integer uniform) are just one. Let's look at several and cases in which they might be used.

Binomial

Suppose you toss a coin N times: the number of heads you get follows a distribution called the "binomial" (for reasons that will be more clear next week).

Q: Suppose you toss a coin twice. What's the most likely number of heads? Why is that the most likely number? What does the whole distribution look like?

Q: Same questions, with three tosses.

The binomial is often used in modeling systems, anytime you have a number of independent items, each of which can succeed or fail, and you're interested in how many total successes or failures you have. Here are some examples. Note that I have made up these numbers; any relationship with reality is purely coincidental.

In general, the binomial is characterized by two parameters:

Let's investigate these using Extend and Excel. In Extend, we just use our random number generator and a histogram block. In Excel, we can use the "Random Number Generator Analysis Tool": Tools > Data Analysis > Random Number Generation. This opens a dialog box in which you get to

Negative Binomial

Suppose we toss a coin until we get a head. What does that distribution look like? Suppose we wait for two heads? What does that distribution look like? We'll spend a little time on this, drawing out event trees and doing some calculations by hand.

A distribution like this is called a negative binomial. It is characterized by P, the probability of success on each trial, and S, the number of successes you want to wait for. The distribution is of the number of failures (since the number of successes is known).

Here are some examples. Again, I made up the numbers.

We'll play with these in Extend and in Excel.

Q: How might we use the negative binomial in our simulations?

Poisson

The Poisson distribution is sometimes defined as a limit of the binomial distribution, and sometimes it is defined as its own separate distribution.

As its own distribution, it's a distribution of the integer number of things that you get, given their rate, which can be any positive real number. For example, the number of calls to a hotline in a particular hour, given that the average number of calls is 3.2 per hour.

The Poisson is related to the binomial in the following way: In the binomial, the product NP is the expected number of successes, and as N gets larger and P gets smaller (so that the product stays the same), we get the Poisson distribution. The product NP is the rate of the Poisson process. For example, there's some non-contagious tree disease, where the expected number of diseased trees per acre is 1.3, the distribution of trees with the disease per acre is Poisson.

How might we use this in our simulations?

All Bets are Off

Everything that we've looked at so far is nice and unimodal. Suppose we've collected some data and it has the following distribution:

0: 50%
1: 15%
2: 20%
3: 0%
4: 15%

Imagine this is the distribution of number of bicycles per household. (Why does this make any sense?)

We could do this in Extend using the "Empirical Table" distribution. This is a tremendously flexible, though occasionally tedious, way of specifing a distribution.

It's not a bad idea to first try to understand a given source of randomness using one of the conventional families first (such as binomial), since that helps with a mental model of what is going on. But if those don't work, it's better to use an empirical table than to use the wrong kind of random numbers in your simulation.

This work is licensed under a Creative Commons License | Creative Commons License | Viewable With Any
Browser | Valid HTML 4.01! | Valid CSS!