I. Understanding Data
Statistics is about understanding the world through data. We must first understand data and the kinds of patterns and relationships we are likely to find in data.
2. Data and Measurement
We begin with a discussion of the nature of data, and discuss the importance of understanding data in its context.
Without knowing the context of datawho was measured? what was measured? when was it measured? why was it measured?we have no hope of understanding data.
The "Who, What, When, Why" questions are considered throughout the course for every dataset to continually remind students of the importance of knowing the context of the data. We differentiate quantitative from categorical data and define the concept of a variable. Students collect some data on themselves and take a first look at it in Data Desk.
Key Points:
| Data | systematically recorded information, whether numbers or labels, together with a context. |
| Variables | hold the same information about many individuals. |
| Cases | holds all the information for a single individual. |
| Context | typically tells: who was measured, what was measured, and why the study was performed. |
| Categorical data | data that name categories (whether with text or numerals.) |
| Quantitative data | numeric data in which the numbers are measurements. Quantitative data always have... |
| Units | a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams. |
Classroom Exercise: This is an excellent place to take a "survey" of the class. Be sure to include some categorical variables (sex, class, major) and some quantitative variables (height, weight, age). It is easy to generate some variables that are not clearly one or the other (shoe size, number of siblings, self-rating on a 5-point scale from liberal to conservative). Other fun questions: "guess the teacher's age", "Here is a ribbon; estimate its length." "Think of a number at random between 1 and 10 and write it down quickly." "How many coins do you have with you right now?"
Class discussion here can fill in the W's and identify quantitative and categorical variables. Students see quickly that the real world isn't all that simple. (Is the opinion variable quantitative? What are the units of shoe size? Is shoe size in the same units for men and women?) The class data can be put into a Data Desk file and returned to the class for homework exercises or used in analyses to illustrate subsequent classes.
Teachers' Notes: It is important that students "get their feet wet" on the computer with a variety of activities. The lesson contains an experiment that students perform on themselves, testing their reaction time and "mousing" skills. Students save their data and return to it for homework assignments throughout the course. (A spare set of data are on the disk for students who lose theirs.) The experiment also serves as an exercise in using the mouse efficientlya skill students will need throughout the course.
The initial activity in Data Desk introduces the statistics environment and shows students how easy it is to make a few displays. Two easy self-examination quizzes introduce the quiz methods and help reassure students. Activities that have students read selected portions of the Data Desk documentation are included in the Lesson. These are less exciting than other activities because students are asked to read several pages from the screen. Nevertheless, they are important, and students should be specifically encouraged to read the material.
Background Notes: Variables are categorical or quantitative at this stage because we are dealing with data. Random variables, when seen later in the course, are discrete or continuous. The concepts are similar, but should not be confused. One important difference: categorical variables need not be numeric, but all random variables, including discrete ones, are numeric.
Also note the focus here on data and on understanding the data. There is deliberately no mention of a population or sample. Our goal at first is to understand the data at hand. We'll draw inferences from the data only later in the course, after we understand something about the data.
3. The Distribution of One Variable
We start with tools and tactics for exploring data. For a single measured variable, the central idea is a distribution of values.
The distribution of a variable gives the values the variable can take and tells how frequently it takes each of them.
ActivStats illustrates this idea with a categorical variable, showing how bar charts and pie charts display the relative frequencies and computing proportions as numerical measures of relative frequency. An exposition defines the Area Principle as a guide for data display.
A video shows the Army's concern with the distribution of soldiers' sizes. We introduce the stem-and-leaf display (of heights) as a simple display of one quantitative variable and the histogram as a more general display. We introduce the concepts of shape, center, and spread of the distribution.
We introduce the dotplot as a graph of values against a single vertical axis, and teach the skills of dragging, adding, and identifying points on the plot. We show the stem-and-leaf and the histogram for the same data, and discuss shape, center, and spread for this display. We learn to display the distribution of variables in Data Desk, and relate dotplots and histograms with brushing and slicing. We then show the Normal distribution shape as an example.
Key Points:
| Distribution | The distribution of a variable gives
* the possible values of the variable and * the relative frequency of each value. |
| Bar Charts | Bar charts show a bar for each category of a categorical variable. |
| Area Principle | In a statistical display, each data value should be represented by the same amount of area. |
| Frequency Table | A frequency table lists the categories in a categorical variable and gives the counts or percentage of observations of each category. |
| Stem-and-Leaf display | A stem-and-leaf display is best defined by example. |
| Histogram | A histogram shows the distribution of values in a quantitative variable with adjacent bars. Each bar represents the relative frequency of values falling in an interval of values. |
| Dotplot | Dotplots represent individual values of a quantitative value along a single axis. |
| Distribution shape | To describe the distribution of data, look for
* symmetry vs skewness * single vs multiple modes * possible outliers or gaps. |
Classroom Exercise: Working with the class survey data from the previous class, discuss the following questions:
Teachers' Notes: Our approach to distributions is deliberately laced with vague concepts. Students often approach statistics expecting formulas and algorithms. We want them to think about data and what it says about the world. For example, it is more important at this lesson to think about the concept of the "center" of a distribution than to learn 3 ways to measure the center (that comes in the next lesson). Encourage students to discover that they can already reason about data using common sense, so that when we introduce formulas it won't turn off their brains.
The ability of ActivStats and of Data Desk to rescale histograms in place is an early encouragement for students to be skeptical of statistical displays made by others. ActivStats continues to warn against bad data displays throughout the course.
The Area Principle is worth noting because it is intuitively pleasing here. Later, it will generalize into the representation of relative frequency by area under a density curve and from there to the principle visualization of continuous random variables and of probabilities for inference.
4. Summary Statistics
A video about the "Old Faithful" geyser raises the problem of predicting the wait until the next eruption, and students examine real data from the geyser.
The center of a distribution, summarized with the mean, median, or midrange is the primary summary statistic. The spread, summarized with the standard deviation, IQR, or range tells us how well the center summarizes the data.
The interactive dotplot provides a tool in which students discover the sensitivity or resistance of these measures to changes in individual data values. In the histogram students discover that the middle 38% of a unimodal, symmetric histogram is about 1 standard deviation wide. (And, of course, that the middle 50% is 1 IQR wide). Students learn to compute summary statistics in Data Desk.
By creating datasets in the dotplot tool that have specified center and spread, students discover more about how these concepts and their estimates behave.
Finally, we address how measures of center and spread behave under addition of a constant or multiplication by a constant, and introduce standardization.
Key Concepts:
| Center | We summarize the center of a distribution with the mean, median, or midrange. |
| Mean | The mean is found by summing all the data values and Piding by the count. |
| Median | The median is the middle value with half data above and half below it. |
| Midrange | The midranges averages the maximum and minimum values. |
| Spread | We summarize the spread of a distribution with the standard deviation, interquartile range, and range. |
| Variance | The variance is the sum of squared deviations from the mean, Pided by the count minus one. |
| Standard deviation | The standard deviation is the square root of the variance. |
| Interquartile range (IQR) | The IQR is the nonnegative difference between the first and third quartiles. |
| Range | The range is the nonnegative difference between the maximum and minimum values. |
| Standardizing | We standardize data values by subtracting their mean and Piding by their standard deviation. |
Teachers' Notes: Students interactive discovery of the properties of these estimates is the first introduction of discovery learning about statistics concepts. The dotplot tool can also be used for classroom demonstration of these ideas (without diluting its effectiveness for hands-on learning; there is value to students in physically dragging a datapoint and watching the mean or standard deviation change in response.) The danger of outliers and the effect they can have on statistical analyses is a theme repeated often throughout the course.
The rule that the middle 38% of the data is about one standard deviation wide will return several times. It is found in discussions of the Normal distribution and in the discovery of confidence intervals.
The Old Faithful data return later when we introduce the duration of the previous eruption first as a category and then as a quantitative predictor.
Notation: ActivStats denotes the arbitrary data value y, and thus denotes the sample mean, y-bar. This choice maintains consistency with the dotplot visualization, which shows values on the vertical (y-) axis. The vertical axis is more natural for discussing values of data and statistics with beginning students because "up" corresponds naturally to "higher value". The drawing of the dotplot axis deliberately resembles the y-axis of the scatterplot, which will be introduced shortly (along with similar abilities to drag and place points on the display). This choice is also consistent with the use of y to denote the variable that we predict in a regression equationa usage that will be discussed in only a few more lessons.
Nevertheless, most statistics texts denote the arbitrary data value with an x, and denote the sample mean x-bar. You may wish to draw students' attention to the difference, and note that a "bar" over any symbol or variable name in statistics is commonly taken to indicate the mean.
Background Notes: ActivStats does not discuss the mode as a center. The mode is a useful description of a continuous density function and is discussed in that context in the Normal Distribution lesson, but it is rarely useful for describing data. Most statistics programs don't compute the mode because the location of the mode depends so heavily on the scaling of a histogram. An asterisk discusses these points briefly. Teachers who wish to teach the mode can do so either in this lesson or (possibly more naturally) as part of the Normal Distribution lesson.
5. Normal Distributions
Density curves relate a range of data values to the relative frequency of the values in that range.
We meet the Normal distribution as an idealization of a "nice" histogram shape. A tool relates area under the Normal curve to a horizontal axis, and we learn why the 38% rule of thumb works. Students learn to make and use normal probability plots in Data Desk. Students discover the 68-95-99.7 rule (sometimes called the empirical rule) for themselves. Finally, we discuss z-scores (building on the standardization discussion of the previous lesson). Students measure their own pulse rate and find their relative location in the population based on their z-score.
ActivStats offers an interactive Normal table that looks like the table in the back of every introductory statistics text except that the diagram at the top of the table moves to correspond to the selected cell of the table, helping students to visualize how the table works.
Key Points:
| Density curve | Density curve is a curve that * is always on or above the horizontal axis * has an area of exactly 1 underneath it. |
| Normal density | The normal density is characterized by several important features: * It is specified by its mean and standard deviation. * It is unimodal and symmetric * It is characterized approximately by the 68-95-99.7 rule. * The central 38% of the normal density is one standard deviation wide. |
| z-score | z-scores measure differences from the mean in terms of the standard deviation of the variable in question. |
Teachers' Notes: The normal density tool introduced here will be the workhorse of the chapters on inference. Although this lesson discusses densities, we never use the term probability. At this stage of the course, we only refer to relative frequencies. This is done to set up the students' discovery of probability for themselves a bit later.
ActivStats introduces the area under a density curve by generalizing from a histogram. Thus the area principle introduced as an intuitive rule for data displays motivates the sophisticated concept of a density curve. Students will encounter an animation that reinforces this concept further when we draw random quantities from Normally distributed populations in animated simulations.
The Normal table tool introduced here can also be opened easily from the Appendix (lesson 25). It is the prototype for animated versions of other common density tables that will be introduced as needed. The diagram at the top of the table is deliberately similar to the look and feel of the Normal Density tool. In particular, students can grab the Piding line with their mouse and drag it to a desired tail probability; the table highlights the corresponding cells automatically.