## Checkpoint - Course 2, Unit 3 Patterns of Association

In each unit, there is a final lesson and Checkpoint that helps students summarize the key ideas in the unit. The final Checkpoint will generally be discussed in class, with the teacher facilitating the summarizing, and students making notes in their Math Toolkits (teachers may just refer to this as "notes") of any points they need to remember, adding illustrative examples as needed. If your student is having difficulty with any investigation in this unit, this Checkpoint and the accompanying answers may help you recall the concepts involved, and give you the big picture of what the entire unit is about. If your student has completed the unit, then a version of this should be in his or her notes or toolkit. Students should also have Technology Tips in their toolkits, which may be useful for this unit.

Possible Responses to Unit Summary Checkpoint
In this unit, students studied regression and correlation - ways of summarizing the center and spread of an elliptical pattern of paired data. The bolded words are vocabulary and concepts with which your student should become familiar.
a. Describe how the idea of a sum of squared differences is used in correlation coefficients and in regression equations.
The difference referred to is the gap between the y value of an observed data point and the point on the regression line, vertically above or below the data point. The point on the line represents a prediction about the y value (dependent variable) given the same x value (independent variable) as the data point. Since the points on the line are the result of using the regression equation, the difference between the actual y value (the data point) and the predicted y value (on the line) can be thought of as an error in prediction. The smaller the error, the better the prediction. Some points will be closer to the regression line than others. The sum of squared differences gives a measure of how well a line fits the data. The calculator is programmed to find the line that fits the data best. This line, called the least squares regression line, has the smallest possible sum of squared differences. Pearson's correlation coefficient, r, is a single number that gives a measure of the strength of the linear association between variables. Perfectly linear data will have a correlation of 1 (for a positive association), or a correlation of -1 (for a negative association). A set of data that is quite random will have a correlation close to zero. The formula for r is . (See page 188 in the student text.) The denominator also has squared differences, but this time the difference is not between an observed point and the predicted point. The differences in this formula refer to how far any particular observation is from the centroid, .
b. Does a strong correlation imply a cause-and-effect relationship?
No. There were many examples in this unit where two variables were strongly correlated but there was no cause and effect. For example, reading scores and shoe size are strongly correlated. As one increases, the other does also. A line will fit the pattern in a scatterplot quite well. But, we would never claim that having bigger feet causes a child to read better. Rather both of these variables are correlated to a third variable, age. There have been many other anomalous relationships documented, some humorous, like the strong correlation between women's hem lines and the performance of the stock market, or the consumption of liquor and teacher salaries. (In addition to this, there are situations where a high correlation is caused simply because a single data point is far from the rest of the points. This may create a large correlation, while the data may not in fact show a linear trend overall. In this case, the high correlation makes one think there is a strong correlation, when there may be no relationship at all, cause-and-effect or otherwise, between the variables. A point like this is called an influential point. It should be noted that the value of r indicates how closely the points cluster about the regression line. It says nothing about whether such a line is an appropriate model.)
c. How are regression lines used?
There are two important ways. The first is to make an estimate or prediction of y for some particular value of x. For example, you might make a prediction of ninth grade GPA for a student with a GPA of 2.0 in eighth grade. (See Lesson 3 in the student text.) Or if a researcher collected data on annual cigarette consumption and annual cancer deaths for various countries, he/she might find that the points formed an elliptical cloud on the scatterplot, so computing a line of regression would be appropriate. This equation could then be used to predict cancer deaths for different cigarette consumption rates. The second way a regression line can be used is to make a model for a relationship, not with the idea of making predictions, but with the idea of studying how much of the variation in the response variable is related to changes in the input variable. For example, a scientist might use a regression equation for the (exposure to radioactive waste, cancer deaths) data on page 211 of the student text to try to understand the degree to which exposure to radioactivity contributes to cancer deaths. The scientist may not be interested in predicting cancer deaths for some other community, but could be looking for a model of the effect of exposure to radioactivity.
d. Refer back to the "Think About This Situation" questions on page 171. How would you answer those questions now?
a. A scatterplot shows that graduation rates and overall score have a positive association. The least squares regression line is Score = 39.0 + 0.57(Graduation Rate). The slope indicates the rate at which the score changes in response to changes in graduation rate. For every 1% improvement in graduation rate, the overall score increases by 0.57. Measure the strength of this relationship with a correlation coefficient, r = 0.49. To decide if the relationship is cause-and-effect, students must consider whether deliberately making a change in one of the variables will affect the other. In this case, will making a policy change that leads to graduation rates improving also cause people to rate a school higher? Using the regression line of Score = 39.0 + 0.57(Graduation Rate), substitute 0.86 for the graduation rate, and find the corresponding overall score. This gives a prediction of 87.8, which is not a good prediction because Wake Forest's score must be below 80, or Wake Forest would be on this top 20 list. (In fact, Wake Forest's score is 77, so the regression line does not always make good predictions.)

If you would like to see specific problems from Course 2, Unit 3, a link is provided to Examples of Tasks from Course 2, Unit 3. If you are interested in following up on the Statistics strand in general, then the Scope and Sequence will help you see where different concepts are introduced. On the Statistics page, you can read an explanation of the main statistics concepts as they are developed in all four courses.