Statistics
Concepts

This is an acquaintance lesson; it provides some basic background. Click on a bar for a more detailed discussion of a particular point; a bar at the end of that discussion will bring you back to this page. Don't miss the Summary, and don't skip the Problems.

The Probability Concept

Typical math or physics problems have exact answers. For example:

There are 10 eggs in the basket.
He is 68.3 inches tall.
That guy is cheating.

The usual statistics problem has an answer in terms of probability. It deals with things that are true within limits, or conclusions adopted with a given degree of certainty. For instance:

The most likely case is that there are 10 eggs in the basket.
He is 68.3 inches tall, give or take 0.2 inches.
It is 99% certain that that guy is cheating.

The numbers in these statements do not define a precisely known outcome. Instead, they define an area of probability. A situation may have an inherent tendency (a coin falls Heads half the time; on days like this it usually rains), but the tendency and the event are not tightly linked. In between lies a zone governed by chance. Understanding chance is the key to understanding statistics. As Lucien Le Cam observed, it is this chance element, this loose connection between tendency and event, that distinguishes statistics from the rest of mathematics.

Chance

There is a certain amount of leeway between what is most likely to happen and what can actually happen. That prevents our being certain, but it doesn't prevent our making sensible choices. Suppose there are two lotteries. Each costs a dollar to play. In both, if you draw a red ticket from an urn containing 100 tickets, you get a \$1,000 prize. If you draw a black ticket, you get nothing. The urn for Lottery A contains 90 red tickets and 10 black tickets. The urn for Lottery B contains 50 of each color. Which lottery will you play?

The tickets in Urn A are 90% red, so the chance of winning in Lottery A is 90%. In Lottery B it is 50%. Anything can happen on your one chance, but some things are likelier than others. If you chose to play Lottery A, you are not going to have trouble with the concept of chance. And if you can handle chance, you can handle statistics.

Certainty Levels

Suppose we observe an event which, in the nature of things, would tend to arise by chance once in a hundred times. That is, its probability of occurrence p is given by

p = 1/100 = 0.0100

But the event has occurred. Is this grounds for assuming that we are dealing with nonrandom forces? To put it another way: We are 99% sure that the event is not due to chance. Is that enough? Do we need more evidence before acting on this information? The action level is technically called an alpha or a level. What would be a rational a leve?

There are several conventional levels. When legislatures want to be sure that a meaningful vote has been cast, they require a two-thirds rather than a simple majority vote. That is an alpha level of a = 0.6667. Some people follow a 9/10 guideline; for them, a = 0.9000. Managers often use a 19 in 20 or 95% level (a = 0.9500) as defining "probably significant," and a 99 in 100 or 99% level (a = 0.9900) as defining "definitely significant." Decisions taken at the 99% threshold have, over time, only a 1 in 100 chance of being wrong.

Some people claim that these 95% and 99% levels are are merely arbitrary. That is too strong. For one thing, they would probably not be still so widely used if they had proved grossly ineffective in real life. In these pages, we will generally follow the 95% "probably" and the 99% "definitely" levels, but we will also subject them to scrutiny. It turns out that they are not entirely arbitrary. It turns out that many practical decisions are made at much lower a levels than 95% or 99%, but that the 2/3 and 9/10 guidelines have their logic too.

Some textbook writers love counterintuitive results. They will favor problems where the above a rules say that nothing significant is happening, but where the reader will strongly feel that action is warranted. This merely illustrates the fact that ordinary people use a levels lower than 95% or 99%. They accept less certainty, partly due to their need for decisions now and not later.

Kinds of Errors

For any given a level, there will be errors in practice. The a level is not a line between True and False. It is the degree of certainty which we have previously decided we want to have, before acting on a decision about True vs False. Certainty can never be complete, and there is always some chance of error. One way to make this concept more precise is to consider the effects of the errors we may make.

Suppose that we observe a result that may be due to chance, such as normal variation in the output of a machine, in which case we want to ignore it. But it may also be caused by something, such as the machine having drifted out of alignment, in which case the output will be systematically wrong, and we want to fix it. Which is right? Our basic assumption for any such situation is that nothing but chance variation is going on. This is called the null ("nothing") hypothesis. Data are then collected to see if they give grounds for deciding otherwise ("rejecting the null hypothesis"), and concluding that in all likelihood "something is going on." Our alpha or action level defines the point at which we will interfere in the process. In choosing that level, we run the risk of two types of error:

(1) We can set the action level too low. Random results will rather often reach that low level. When they do, we will reject the null hypothesis and assume that something significant is happening, when the evidence for "something happening" isn't statistically very strong. We will intervene, but we will find that in a good number of those cases, nothing was actually happening. Overreacting in this way, fixing something when it doesn't need fixing, is called a Type 1 Error. We might think of it as the Nervous Error. In these cases, we will wrongly decide that some malign cause is operating when mere chance has produced the result. The red light is going on too often. We are being too jittery.

(2) Or, we can set the action level too high. We will then often fail to reject the null hypothesis, and decide that nothing is happening, when the evidence for "something happening" is actually pretty good. In some of those cases, something bad will actually be happening. In those cases, we should have gone ahead and fixed the machine. Not fixing the situation when it really needs fixing is a Type 2 Error. We may think of it as the Complacency Error. Our warning system is not telling us soon enough that bad stuff is happening, and some bad stuff is thus not getting detected. The green light is staying on too long. We are being too lax.

The chance of making one type of error cannot be reduced without increasing the risk of making the other type of error. Why? Because you can set your alpha level only in one place. You must choose. How do we decide where to set it? One seemingly rational way is to set the level so that the chance of Type 1 Errors (overreacting) is the same as the chance of Type 2 Errors (underreacting). That has an attractive quality of seeming fairness to it. But sometimes practical considerations may make one type of error much more tolerable than the other. One such consideration is financial. "Intervening will cost a lot of money, so we need to be extra sure before we do so." This tends to push the action level higher. Or, conversely, "our contract calls for a shipment to be rejected if a sample is found to contain more than a tiny fraction of defectives." In this case, the only prudent course is to aim for a very low proportion of defectives.

Another way to visualize these Error Types is in terms of a medical test for a rare disease. The test reads positive if you have the disease. But it is not perfect. Sometimes it will read positive when you don't in fact have the disease; this is a "false positive." The opposite error is to miss the disease when in fact you have it; this could be called a "false negative." The false positive is a Type 1 Error: too sensitive and thus picking up on an actually harmless situation. The false negative is a Type 2 Error: not sensitive enough, and so missing some cases of the disease. Graphically

 Test is Positive Test is Negative Have Disease OK False Negative: Type 2 Don't Have Disease False Positive: Type 1 OK

This is all very clear. But we still have the problem that the test can be calibrated only to one level. Any level you choose will give some mixture of Type 1 and Type 2 Errors (or else it will be set at such an extreme level as to be useless).

Ethics

Separately, there are the ethical questions. For example: Wrongly accusing people of fraud is bad. It involves social costs, and sometimes legal costs.

• Will you ignore a car steering defect that if not corrected will kill hundreds of unsuspecting motorists and their little children? If you don't want this on your conscience, you may set the action level lower, and increase the risk of a Type 1 or Nervous Error (spending money needlessly).
• Will you wrongly accuse your best friend, who is getting "all Heads" in a rather long series of coin tosses, of using a dishonest coin? If you shrink from doing this, you may set the action level higher, and as a result, you will increase the risk of a Type 2 or Complacency Error (getting cheated needlessly).

Rumshiskii 70 points out that no answer to these ethical considerations can be given by probability theory alone. A practical choice will always depend in part on human factors. His example is that a 1% chance of a defective part getting into some manufactured gizmo is tolerable, because after all, a gizmo is only a gizmo. But an output of 1% defective parachutes will kill 1% of your parachutists, which is not acceptable, so in that situation every parachute must be inspected, regardless of cost.

Emotions run high on other subjects too, such as the risk of convicting an innocent person, or conversely, the risk of releasing from prison a still dangerous criminal. It makes a difference how much is riding on the decision, not only in the experimenter's or the manager's mind, but in the mind of the larger public. Probabilities as such are not subjective, but responses to them may easily be affected by cultural factors.

Bias

This means that for some situations, biased statistics may be preferable to good statistics. This does not change the definition of good statistics. Good statistics is to give the best interpretation which the evidence permits. Errors will happen, but if we can't accept the idea of being wrong some of the time, we are not really in a position to make use of statistics. In that case, why bother with statistics in the first place?

You are an archaeologist. Some measurement or other shows that an artifact is more likely to come from Tribe A than from Tribe B, but all other evidence favors Tribe B. One response to this dilemma is to choose a very high a level for the artifact test, so that only an overwhelming indication will be accepted as evidence for Tribe A. This is madness. It contaminates the evidence of the artifact with evidence from other things. Balancing conflicting evidence properly belongs to the next higher level. It should not be built into the primary analysis level.

Confidence Intervals

The concept of the confidence interval, along with much of the modern doctrine of hypothesis testing, was worked out in a collaboration between Jerzy Neyman and Egon Pearson in the late 1920's and early 1930's. Its basic statement is that one may be X% certain that some desired value, say the center point of a population, lies in the interval between Y and Z. This is different from the "action level" discussed above. A confidence interval is not a level at which you do something, it is a zone within which the expected value should occur, with a stated degree of probability.

We will look at this theory a little later on. Meanwhile, distinguish "degree of certainty" (as in the alpha a level discussed above) from "degree of confidence" (in the sense of "confidence interval"). The two are not the same.

Experiment Design

There are whole volumes on this subject, especially as applied to agricultural and medical testing. The theory of sampling is a closely related topic. We will not devote much space to either (there are endless textbooks with precisely that orientation, and our purpose is different), but here are a few points.

(1) To best reflect the population from which it is taken, a sample should be random, meaning that every member of the population has an equal chance of being selected for the sample.

(2) If you want your test of some agricultural or medical treatment to be valid, you must do it on two groups in parallel, one group receiving the treatment under test, and one not. The latter group, which is chosen to resemble the test group as much as possible, is called the "control" group. The idea is to eliminate other possible causes of any effect that may later be observed. It must be borne in mind that experiments with human subjects are infinitely complicated by ego phenomena (the Hawthorne Effect), and in other ways as well.

(3) A controlled experiment should be planned to give enough information that the question of interest can be answered with the desired degree of assurance. If we know what test will be used to analyze the data, and what threshold of significance will be applied (or what alpha a level will be adopted), we can calculate backward to see how much data will be required.

We cannot plan experiments in history; we take what history happens to give us. But it can still be useful to know what a good design would be like, and what the data at hand are (and are not) capable of telling us. One valid conclusion for historians is that there is not enough evidence to reach a conclusion. To say even that is better than to abide in a state of perplexed confusion. "Confucius" has said that to know when you know, and to know when you don't know, is what we really mean by "knowing." Every statistician in the world will agree.

Summary

[Under this rubric, in each Lesson, we collect the chief things to be retained from that Lesson]

Probabilities are written as decimals, thus p = 0.3333
"Certainty" means "degree of certainty that a result is not due to chance."

Our usual action or alpha level will be a = 0.99 certainty.
It means that there is only a 1% chance that the result is a random event.
The "worry" or beta level is b = 0.95 certainty.
This means that we are getting significantly close to the chosen action level.
Practical people often operate with much lower alpha and beta levels.

The choice of an action level may sometimes be affected by cultural factors.

Thinking you have a significant result when you don't is a Type 1 error.
You are setting your alpha level too low, and reacting to some harmless signals.
Failing to see that you have a significant result is a Type 2 error.
You are setting your alpha level too high, and ignoring some meaningful signals.

Mnemonic: "1" is the lower number, and a Type 1 Error is setting your response level too low. "2" is the higher number, and a Type 2 Error is setting your response level too high.

End Matter

[This section of each Lesson contains practice and other material which is useful for a fuller understanding of the topics introduced. Much of the substance of the Lesson is in this material, for which the Lesson proper serves as a mere orientation. Have your calculator and your pad of squared paper ready before starting].

[After working through above items, use the arrow below to return to the Lessons Page].