II. Understanding Relationships
Relationships among variables are fundamental to describing the world. Most statements of science and social science, and most decisions in business and medicine are, at their core, statements about how two or more variables are, or should be related.
6 Comparing Groups
A video introduces the question of whether a severely restricted diet can prolong life. Students examine data from experiments on rat diet and lifetimes and compare the lifetimes of two treatment groups with boxplots. Students relate boxplots to dotplots, and learn to do all of this in Data Desk. Five number summaries provide a numerical backup for the graphical displays.
To compare groups, consider the difference of their centers relative to the size of their spreads.
A case study considers the relative safety of passengers and drivers in car crashes, analyzing the data in Data Desk.
Key Points
| Boxplots | When comparing groups with boxplots: * Compare the medians; which group has the higher center? * Compare the IQR's; which group is more spread out? * Compare the difference between the medians to the IQR's. On the scale of the IQR's are the medians very different? * Check for possible outliers. Identify them if you can. |
Teachers' Notes: This is the first introduction of relationships among variables. One important goal is getting students to write intelligent sentences about the world based on what they see in the data. To do this they must remember the "W's" of the variables (who, what, why, when, where) and know the variable's units. Nevertheless, relationships are easier to write about than distributions.
The Web links point to a discussion of restricted diet and life extension. This can be fodder for a class discussion.
This is a good place for assignments that ask students to find uses of statistics in magazines, news papers, and web sites and write short essays explaining them.
The concepts introduced here will return when we test for differences between the means of two groups.
7 Scatterplots
A video about the Boston Beanstalks suggests that people seek others of the same height to marry. A dataset giving heights of husbands and wives provides data for a scatterplot. The scatterplot tool generalizes the skills learned in the dotplot tool; we can place, drag, and identify points.
To summarize a scatterplot, describe the direction, form, and strength of the relationship between the variables. The simplest form is a straight line.
Students learn to make scatterplots in Data Desk and to identify points in the plots.
The final page of the lesson discusses data transformation, introducing the family of powers. Most of the work is done in Data Desk where students can slide the power and view changes immediately.
Key Points
| Scatterplots | show the relationship between two quantitative variables measured on the same cases. |
| In a Scatterplot, look for: | * direction * form * strength |
| In a Scatterplot, look for: | * pattern (form) * deviation from that pattern |
| Reexpress data | * to improve symmetry * to make several groups have more nearly equal spreads * to make the form of a scatterplot more nearly linear * to make a scatterplot more nearly have consistent spread throughout |
Teachers' Notes: Scatterplots are the base for correlation and regression and also appear in later simulations. The basic skills of reading and understanding scatterplots are among the skills students should take from the course. Moreover, they will be used in other lessons, so students should become comfortable with scatterplots now.
Transformations come back later, primarily in exercises, where data are sometimes reexpressed to simplify analysis. They are not needed in subsequent lessons and can be omitted, but they are powerful and effective methods of data analysis that extend the reach of the course to more kinds of data, so we urge their inclusion.
Background Notes: Many textbooks treat scatterplots lightly or not at all. If you are supplementing such a text with ActivStats be sure to emphasize to students the importance of interpreting relationships in scatterplots and the value of scatterplots as tools for understanding relationships and identifying extraordinary cases.
8 Correlation
Correlation summarizes the strength of a linear relationship.
We construct scatterplots with specified correlations, placing points on a scatterplot and dragging them, to learn about the sensitivity of correlation to outliers and nonlinearities. An animation in Data Desk shows typical plots for various correlations, allowing the student to slide the correlation and watch the change in the scatterplot.
Key Points:
| Correlation | correlation is a numerical measure of the direction and strength of a linear association. |
| Correlation is not sensitive to: | * center * scale * units |
| Correlation gives little information about: | * form of the relationship * outliers |
Teachers' Notes: The tactile experience of dragging a point around a scatterplot while watching the correlation change is a remarkably effective way to learn about the sensitivity of correlation to changes in individual data values. Constructing scatterplots with a specified correlation is the active learning version of viewing a page holding a variety of scatterplots with different correlations, found in most statistics texts. The related experience of dragging the correlation value while watching the plot adjust helps in visualizing what "r = 80%" looks like.
9 Least Squares Lines
A video about increasing manatee deaths introduces the problem of relating two quantitative variables. Students scatterplot the data on manatees killed by motorboats and the number of motorboats registered, and fit a line to the data. ActivStats offers an optional review of the algebra of lines (in "statistics notation" where the slope is "b", not "m").
Students learn to describe and interpret linear relationships. A tool shows the relationship between lines on the scatterplot and their equations. We define residuals, and learn to fit lines by satisfying the least squares criterion. A tool lets students drag the line to try to minimize the sum of squared residuals.
Least squares fits lines by minimizing the sum of squared residuals.
We construct scatterplots with specified slope, learning about the sensitivity of the LS line to individual values.
Finally, we discuss the relationship between correlation and regression, b = r * sy/sx. This expression holds much useful insight. Correlation describes the strength of a linear relationship, but not the relationship itself. If there is a linear relationship, it is best to describe it directly, supporting that description with the correlation. Because correlation has no units, the ratio of the standard deviations introduces the units of the slope, y-units per x-unit.
Key Points:
| Linear Equation | is an equation of the form y = a + bx. To interpret a linear equation we need to know only the variables and their units |
| Slope | The slope gives a value in "y-units per x-unit" |
| Intercept | The intercept gives a starting value in y-units. |
| y and x axes | are typically assigned so that the y-axis holds the response variable and the x-axis holds the explanatory or factor variable. |
| Residuals | are the vertical differences between the data values and the fitted line |
| Least Squares | The least squares criterion finds the line that minimizes the sum of the squared residuals. |
Teachers' Notes: The least squares criterion is, for most students, their first encounter with an optimality condition. The least squares tool provides immediate experience with this optimality and an informal demonstration that there is a single minimum sum of squares and thus that the least squares line is unique.
Lines also provide the first model for data and thus the first real discussion of residuals as deviations from a model. The interaction of individual points with the least squares line in the tool demonstrates the sensitivity of regression to individual data values, continuing our concern with outliers. Astute students may discover leverage for themselves. Considering all of these topics, this lesson introduces many fundamental ideas, although the statistics discussed are elementary.
We return to regression later to discuss inference and residual displays, but even classes that do not plan to cover those topics should not omit this lesson.