V. Drawing Inferences

This section introduces inference for the mean. We start by working with large samples and appealing to the central limit theorem to justify a normal sampling distribution. We concentrate on the reasoning of confidence intervals, estimating the standard deviation from the data where necessary and without apology. We then consider what to do when the sample size is small and the standard deviation is estimated from the data, introducing Student's t.

This is a suitable place to introduce an additional statistics package, if you wish, either just before attacking inference, or (probably easier on students) just afterwards. From this point on, students can and should see the full output of a real statistics program. The Data Desk package is provided with ActivStats, but it is possible for teachers to add lessons on other statistics programs (or, indeed, on any subject) to any page of the lesson book.

17. Estimating with Confidence

We renew our attempt to estimate the mean of a population density. Now we recognize the population mean as a parameter, and we know something of the distribution and variability of the sample mean. Working with the histogram that depicts the sampling distribution of the mean, students construct an empirical 95% interval. We know that the histogram is centered at µ (in the long run) and that the middle 95% should be 2σ either side of the mean. With this knowledge, we consider what we can say about µ, thereby discovering the reasoning of confidence intervals. Standard inference acts as if the sample comes from a randomized data production design.

We ask the question "What would happen if we did this many times?" and look to sampling distributions for answers.

We draw conclusions of the form "If we drew many samples, the intervals calculated by this method would catch the true population parameter in about C% of all samples." We note that it is the deliberately introduced randomness of random sampling or randomized assignment of subjects to treatments that makes inference possible. It is only because we randomize that repeating the sample or experiment yields a sampling distribution.

A video introduces the problem of estimating dissolved oxygen levels in Chesapeake Bay, and a case study follows up on the Bay pollution data.

We now bring back (from the first lesson on the Normal distribution) the Normal tool as a theoretically justified (from the CLT) model for what the histogram would look like with a large enough sample size and many samples. As with the histogram in the sampling distribution tool, the x-axis of the density is in real units rather than in z-score units. We construct confidence intervals for real data, setting the mean of the Normal tool to the sample mean and the standard deviation to the standard error, s/ Ã n. We work only with large samples and do not apologize for estimating σ with s.

We simulate many confidence intervals for the same population in Data Desk, noting that they have the same width, but that because their centers are random, some may not cover the true mean.

We note the value of increasing the sample size to simultaneously narrow the confidence interval and improve the confidence level.

Teacher's Notes: Students can discover the reasoning of confidence intervals for themselves. The Normal density tool facilitates this because students can work directly in the units of the variable. We avoid the awkward reasoning (and computing) path that takes students from data to z-score to z-table value and back to data units to construct the interval. Later in the lesson we use the tool that provides an interactive version of the z-table commonly found in the back of textbooks.

The Normal density tool, and on the later page, the normal table tool are available directly from the tool bar at the top of the lesson book pages in this lesson. (In fact, any tool used in a lesson is available at the top of the pages of that lesson.) Encourage students to use them to solve additional confidence interval exercises from their text or from other sources.

As long as the sample sizes are large (greater than 50 should do fine, but in fact all of ours are greater than 100), the use of the estimated standard deviation does not invalidate reference to the Normal distribution because the estimated standard deviation is good enough and the Central Limit Theorem has plenty of room to work. We'll be more precisely correct presently.

Note: If you are troubled by the use of examples in which σ is estimated from the data but the calculation refers to the Normal distribution and the lack of differentiation based on whether we "know" σ or not, see the Logic note at the end of the discussion of the Lesson on Confidence Intervals for a Mean.

18. Confidence Intervals for a Mean

What if we must estimate the standard deviation from the data and have a modest sample size? We call the sample-based estimate of the standard deviation of the sampling distribution of the mean, its standard error. Estimating a standard error introduces additional variability in our intervals, as we can see with a simulation.

We introduce the t-distribution family as an alternative density that works within the Density tool and see that we construct confidence intervals in much the same way as we did before. Students can slide the df to see the change in the density. They learn in this way that changes in degrees of freedom change the width of the confidence interval very little above 10 degrees of freedom.

Students discover that the width of a confidence interval for small samples is affected far more by the increase in the standard error due to a small n. They construct confidence intervals with the density tool and learn to construct confidence intervals in Data Desk for real data.

Student's t-based procedures are natural in statistics packages because packages can compute the standard deviation and sample size. Because the t-distribution automatically become indistinguishable from the Normal distribution as the sample size grows, t-based procedures are the common default procedure.

We introduce an interactive t-table that works in much the same way as the interactive Normal table. The interactive table offers several advantages over paper-based tables. First, the illustration at the top of the table automatically adjusts to the correct shape for the degrees of freedom. Second, the table extends to 999 degrees of freedom to reduce the false concern that students often adopt for when to move from the t-table to the z-table. Finally, the table allows students to insert a new column for any P-value, freeing them from the impression that only the standard values are useful.

Teacher's Notes: The density tool provides striking visualizations of the effect of sample size on confidence procedures. In particular, it is easy to see that changes in degrees of freedom change confidence intervals only slightly (until the df fall below 10). This helps to justify the argument of the previous lesson that for large samples it is OK to estimate the standard deviation from the data and then refer to a normal sampling distribution.

This insight is not available from the t-tables because they are not scaled to a constant variance (the standard deviation of tk is k/(k-2)), so this visualization of effect of df is rarely seen in introductory (or advanced!) courses. Later in the lesson, ActivStats introduces an interactive t-table. In this tool, the diagram at the top of the table changes as a cell of the table is selected to reflect the appropriate shape corresponding to the degrees of freedom. Unlike book tables, students can insert a new column for any p-value they like; worth doing once at least just to see it.

The bottom line for this section is that students using the computer should expect to use t-based methods, because that is what the common statistics packages provide. Fortunately, t- based methods automatically evolve into normal-based methods as the sample size grows. The only exception is when we have a small sample, know the population standard deviation, and are confident that the population is Normally distributed. In that case, a normal-based method that uses the known standard deviation should be used. But, of course, in practice this combination of conditions is very rare.

When this material is taught with a book t-table, students are often overly concerned with when to use the t-table and when to use the Normal table. The smooth transition and natural visualization of the t-distribution densities eliminates this problem altogether. In addition, students see that the important issue for determining the width of a confidence interval is not the number of degrees of freedom (because the density shape changes little above 10 df), but the denominator in the standard error calculation. This is an important fact that many who use t-based procedures do not visualize correctly.

Logic Note: The order of the argument here is a bit tricky. It goes like this:

  1. The CLT tells us that the sampling distribution of the mean is Normal with a standard deviation of σ/√n.
  2. For large samples, we can estimate σ with s and use the Normal distribution, because the CLT says the sample mean is Normal and the LLN says that s approaches σ.
  3. For small samples, we should use Student's t distribution instead of the Normal to allow for the additional variability in estimating σ with s. This starts out looking like a special case. However...
  4. Actually, we should always use Student's t because it naturally and smoothly approaches the Normal as the sample size grows. The difference in density shape is hard to see even at moderate sample sizes. This is what most statistics software does. Simply put, for small samples, when it matters, use t. For large samples, when it doesn't matter, it doesn't matter, so use t.
  5. So the remaining special case is a small sample with known σ. For this one special case, use the Normal distribution, but then only if we are also confident that the population is Normal because the CLT lacks the sample size to be effective. But, of course, in practice this never happens.

(Whatever happened to the CLT? Well, we are not really using it any more in favor of Student's t. But it probably does not help to point this out to students.)

By following this reasoning, ActivStats downplays the common differentiation of "known σ" vs "unknown σ" as a basis for choosing an inference procedure. We thus do not tell students the common lie that "σ is known," but rather tell them the half-truth that we can use the Normal distribution when we estimate s from the data (and initially only showing examples with large sample sizes). In practice, σ is almost never known. Students often realize that and notice that the assumption of known σ is unreasonable. The second lesson repairs the half truth, telling students that when we estimate σ from the data we should use the t-distribution and that therefore, because we virtually always estimate σ from the data, we should always use t. Moreover, since the t-distribution family smoothly approaches the Normal, there is no need for concern about switching from t to z.

Background Note: We reserve the term "standard error" for the sample-based estimate of the standard deviation of the sampling distribution of a statistic. In particular, we do not use it to refer to the parameter-based estimate of the standard deviation of the sampling distribution. We made this choice to avoid the confusion of using the same term for two related, but different, quantities. We chose to use the term standard error for the data-based calculation to correspond to the common use in major statistics packages of calling the sample-based estimate the standard error.

19. Testing Hypotheses

Returning to the randomness tool, students find blank bars (that is, they cannot see the colors) and the assertion that the true probability of a red outcome is 50%. They collect data as they did before and find this hypothesis plausible but (if they collect sufficient data) unlikely, leading them to reject the hypothesis.

They are then asked to examine the reasoning that they followed to reach this conclusion. They find that they have reasoned according to the formal reasoning of hypothesis testing.

We seek the probability of observing an outcome as far from the one initially claimed. If that probability is sufficiently small, it leads us to doubt the initial claim.

We see a video story about an experiment run by a young student and published in a major journal to test whether Therapeutic Touch practitioners can detect the "human energy field" that they claim to be manipulating. The story leads to a hypothesis test, which the students perform for themselves (failing to reject the null hypothesis that corresponds to failing to detect the field.)

Returning to the density tool, we see that it serves as a hypothesis testing tool in much the same was as it is a confidence interval tool. The principal difference is that the null hypothesis provides a parameter value, which can now be used to center the density.

Students see hypothesis testing in Data Desk and also simulate tests based on many samples from the same population. They also learn to use the Normal table to perform hypothesis tests.

Finally, we consider the context of making a decision in which an alpha-level is specified before the test is performed and discuss Type I and Type II errors, performing a simulation to show that the probability of a Type I error is α.

Teacher's Notes: Students can and do derive the reasoning of significance testing for themselves. The work to set up this exercise can be traced back to the lesson on Randomness. The reasoning of testing is much more natural to students who have discovered it than to those who have been told, so here again we urge that you let students complete their discovery before telling them how to test hypotheses.

By working directly in the units of the variable and avoiding the z-tables, students can concentrate on the reasoning of testing

The true parameter value (proportion red) in the activity in which students discover testing is set randomly, so it will differ for each student and for successive attempts by the same student. It will, however, never be within 2% of 50%, so the null hypothesis is always rejectable with a large enough sample.

The emphasis of this chapter is on significance testing using P-values rather than on traditional reject/fail to reject hypothesis tests. This reflects both modern teaching wisdom and the way most statistics software and statisticians actually work. Students are asked to record the number of observations they drew before stating a conclusion and the proportion of red outcomes that they observed. This will allow them to compute a personal P-value corresponding to when their discomfort with the result became great enough to force rejection of the null hypothesis.

20. Tests for a Mean

A video shows a taste testing experiment using 10 trained tasters, and students immediately examine the data from that experiment. Hypothesis testing for small samples follows naturally from what we already know about hypothesis testing in larger samples and about estimation with small samples.

When we have a small sample and must estimate the standard deviation from the data, we often test a null hypothesis about a mean with Student's t distribution.

The second page shows how to use t-tables to perform standard tests for the mean, discusses the formal assumptions behind t-test, and the choice between t and z tests.

Page three returns to the Therapeutic Touch story to discuss the power of a test. Because the experiment in that story failed to reject the null hypothesis (and because that failure suggests that Therapeutic Touch may be invalid), the central concern is whether the test was sufficiently powerful to detect the phenomenon that was sought. The Normal density tool provides a visualization of power that can help students understand this sometimes confusing topic.

Page four recalls the data on couples, this time examining ages, and shows that we can test for a difference between the means of two paired groups using the same t-test. Paired data, such as in this example, have the advantage of a natural null hypothesis.

Teacher's Notes: This is a good point to be sure that students are focused on practical applications. Students sometimes focus on the process of testing hypotheses, forgetting the underlying meaning. The issue they should focus on is still understanding the world from data.

Our point of view on choosing between t and z is that people who use computers or modern calculators should choose t-based procedures regardless of the sample size. Computers and calculators have no trouble finding t values for large degrees of freedom, and the smooth transition from t to Normal as the df grows takes care of itself. The only time when a z-based method is preferred is when you know that you have a Normal population, you have a small sample size, and you know the population standard deviation. We know of no such situation in the real world.

The discussion of power is optional, although the topic is part of the Advanced Placement Statistics syllabus.

The discussion of paired comparisons leads naturally into the discussion of comparing groups that usually follows this section.

This lesson holds much new material. Although it has a central theme, the subjects of power and paired comparisons are not central to the issue of t-tests for means. As a result, many teachers split this lesson into two class periods.