The word comes from governments (states) counting the population several hundred years ago. Soon, people realized that numbers mattered and could be used to win arguments and further political agendas. So, statistics should not be viewed just as mindless bean-counting, but as framing an argument.
Categorical versus numerical
The population is all the data: every possible value, the appropriate number of times. We can, for example, write down all possible tosses of a six-sided die.
A sample is the actual data we collect. In rare cases, we can sample the entire population, in which case they are equal. Is the U.S. census data the population or a sample?
The biggest problem when collecting a sample is avoiding bias. There are many examples of this. A classic is the "Dewey wins" headline that Truman held up. Why were the polls then wrong? What about modern polls?
Most fields need to worry about this a lot. In simulation, we worry about it too, but in a different way. We need to avoid bias when building our models and choosing their parameters, including their random number generators. Otherwise, they'll only tell us what we expect to see.
descriptive: How to describe or summarize the sample -- its most important features and such. For example, the mean or median. The standard deviation.
inferential: How to use the properties of the sample to make inferences (decide properties of) the population. For example, what is the probability of transmission of bird flu? Is Prozac better than placebo in treating bipolar disorder?
Just a way of visualizing categorical data: counts of various things. We could do one of the majors in this class.
How do you sort a dot diagram? Depends on the purpose and what information you want people to see. Since people are often interested in the biggest, smallest, etc. sorting by number can be useful.
Tufte (The Visual Display of Quantitative Information) describes how, if NASA had sorted the graphic of space shuttle launches by temperature instead of date, the problem would have jumped out at them.
This is a great way to give a picture of data without losing any information. It doesn't work as well with large datasets, though.
A frequency distribution (often just "distribution") gives the number of items that falls in each category. It's a function where f(x) is the number of items for that value of x.
What would be a frequency distribution of ages at Wellesley college?
Some "rules of thumb":
Instead of using classes defined by "between x1 and x2," if all classes are defined by "less than x," we have a cumulative distribution.
Cumulative distributions can be nice because issues about class size can be eliminated. They're also nice because any range can be gotten with just a single subtraction, while in a frequency distribution, you'd have to add up all the classes. Still, they're hard to read.
Histograms are a very important way to picture some data. If you have some spurious data in there, it will probably stick out like a sore thumb. It's usually easy to see the mode, and the median and mean can usually be eyeballed.
Rules of thumb:
Histograms are great, but sometimes we want a summary of some aspect of the distribution. Sometimes we want it for rhetorical reasons ("the average height of men is greater than the average height of women") or for technical reasons (some statistical test) or for something else.
Still, it's important to remember that when you use a summary instead of all the data, you're losing detail that can sometimes be important.
The most common is the mean (the arithmetic average). It's not necessarily the best one, but its the most common. Using the mean is a very common way to "lie with statistics." Examples:
Notation:
Example: we roll a die four times and get two 3's, a 1 and a 5. What's the sample mean? What's the population mean?
The trimmed mean has some advantages because it discards the "high leverage" values.
The weights are usually decided beforehand. For example, your grade in this course is a weighted mean.
The median is the center of the distribution, and isn't easily affected by high-leverage outliers like the mean. When the sample or population is symmetrical, the mean and median are the same. When the distribution isn't symmetrical, the median is better.
So why isn't it used more? It's hard to mathematically analyze. With modern computing (simulation and other techniques), the median can be used nearly as easily, so I think the median will continue to gain ground.
You can compute the median in Excel using the median(range) function.
The symbol is x-tilde.
Using the 50% point is nice, but there can be times when you're interested in the 90% point or something else. Remember in school when you used to be tested and your parents would say proudly that you were reading at the 90th percentile?
Quartiles are just the 25%, 50% and 75% marks. Imagine using it for grading: A, B, and C. Is this grading "on a curve"?
You can compute a quartile in Excel using the quartile(range,n) function, where "n" is 1, 2 or 3. Note quartile 2 is the same as the median.
Even better, because it's more general, is to use the percentile(range,f) function, where f is a number from 0 to 1 and indicates the fractile you want. If you want the 90th percentile, use 0.9; if you want the 99th percentile, use 0.99; and so forth. If you want the first quartile, use 0.25.
box and whisker plots are a nice visualization technique.
The mode is simply the most frequent value. It's rarely used in statistical work, but it's very commonly used to describe distributions:
This one is uni-modal, but that one is bi-modal.
Here's an example of when it's useful though: If you're in the manufacturing business, say shoes, you don't want to know what the mean or median shoe size is, you want to find out what the mode is. Why?
Often, we want to know not just where our "cloud" of data is, but how spread out it is.
This is rarely used in statistical work, since it's highly variable in a sample and often useless for a population (where the range is often infinite), but it gives a nice intuition.
Would you be interested in the range on a set of exam scores? Why or why not?
If you want the range of some data in Excel, you can use the max(array) and min(array) functions. The range is the max minus the min.
This one is really commonly used, and we'll spend some time developing it.
This is the analog to using the median. It's just the 75th percentile minus the 25th percentile or Q3-Q1. The median is Q2. Just as the median isn't used much, neither is this, for the same reasons.
This work is licensed under a Creative Commons
License |
|
|
|