← return to practice.dsc10.com


Lecture 25 — Collected Practice Questions

Below are practice problems tagged for Lecture 25 (rendered directly from the original exam/quiz sources).


Problem 1

Now let’s study the relationship between a penguin’s bill length (in millimeters) and mass (in grams). Suppose we’re given that


Problem 1.1

Which of the four scatter plots below describe the relationship between bill length and body mass, based on the information provided in the question?


Answer Option 3

Given the correlation coefficient is 0.55, bill length and body mass has a moderate positive correlation. We eliminate Option 1 (strong correlation) and Option 4 (weak correlation).

Given the average bill length is 44 mm, we expect our x-axis to have 44 at the middle, so we eliminate Option 2


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 1.2

Suppose we want to find the regression line that uses bill length, x, to predict body mass, y. The line is of the form y = mx +\ b. What are m and b?

What is m? Give your answer as a number without any units, rounded to three decimal places.

What is b? Give your answer as a number without units, rounded to three decimal places.

Answer: m = 77, b = 812

m = r \cdot \frac{\text{SD of }y }{\text{SD of }x} = 0.55 \cdot \frac{840}{6} = 77 b = \text{mean of }y - m \cdot \text{mean of }x = 4200-77 \cdot 44 = 812


Difficulty: ⭐️

The average score on this problem was 92%.


Problem 1.3

What is the predicted body mass (in grams) of a penguin whose bill length is 44 mm? Give your answer as a number without any units, rounded to three decimal places.

Answer: 4200

y = mx\ +\ b = 77 \cdot 44 + 812 = 3388 +812 = 4200


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 1.4

A particular penguin had a predicted body mass of 6800 grams. What is that penguin’s bill length (in mm)? Give your answer as a number without any units, rounded to three decimal places.

Answer: 77.766

In this question, we want to compute x value given y value y = mx\ +\ b y - b = mx \frac{y - b}{m} = x\ \ \text{(m is nonzero)} x = \frac{y - b}{m} = \frac{6800 - 812}{77} = \frac{5988}{77} \approx 77.766


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 1.5

Below is the residual plot for our regression line.

Which of the following is a valid conclusion that we can draw solely from the residual plot above?

Answer: The accuracy of the regression line’s predictions depends on bill length

The vertical spread in this residual plot is uneven, which implies that the regression line’s predictions aren’t equally accurate for all inputs. This doesn’t necessarily mean that fitting a nonlinear curve would be better. It just impacts how we interpret the regression line’s predictions.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.



Problem 2

For each IKEA desk, we know the cost of producing the desk, in dollars, and the current sale price of the desk, in dollars. We want to predict sale price based on production cost using linear regression.


Problem 2.1

For this scenario, which of the following most likely describes the slope of the regression line when both variables are measured in dollars?

Answer: more than 1

The slope of a line represents the change in y for each change of 1 in x. Therefore, the slope of the regression line is the amount we’d predict the sale price to increase when the production cost of an item increases by one dollar. In other words, it’s the sale price per dollar of production cost. This is almost certainly more than 1, otherwise the company would not make a profit. We’d expect that for any company, the sale price of an item should exceed the production cost, meaning the slope of the regression line has a value greater than one.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 2.2

For this scenario, which of the following most likely describes the slope of the regression line when both variables are measured in standard units?

Answer: between 0 and 1, exclusive

When both variables are measured in standard units, the slope of the regression line is the correlation coefficient. Recall that correlation coefficients are always between -1 and 1, however, because it’s not realistic for production cost and sale price to be negatively correlated (as that would mean products sell for less if they cost more to produce) we can limit our choice of answer to values between 0 and 1. Because a coefficient of 0 would mean there is no correlation and 1 would mean perfect correlation (that is, plotting the data would create a line), these are unlikely occurrences leaving us with the answer being between 0 and 1, exclusive.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 2.3

The residual plot for this regression is shown below.

What is being represented on the horizontal axis of the residual plot?

Answer: actual production cost

Residual plots show x on the horizontal axis and the residuals, or differences between actual y values and predicted y values, on the vertical axis. Therefore, the horizontal axis here shows the production cost. Note that we are not predicting production costs at all, so production cost means the actual cost to produce a product.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.


Problem 2.4

Which of the following is a correct conclusion based on this residual plot? Select all that apply.

Answer: It would be better to fit a nonlinear curve.

Let’s go through each answer choice.

  • The correlation between production cost and sale price could be very strong. After all, we are able to predict the sale price within ten dollars almost all the time, since residuals are almost all between -10 and 10.

  • It would be better to fit a nonlinear curve because the residuals show a pattern. Reading from left to right, they go from mostly negative to mostly positive to mostly negative again. This suggests that a nonlinear curve might be a better fit for our data.

  • Our predictions are typically within ten dollars of the actual sale price, and this is consistent throughout. We see this on the residual plot by a fairly even vertical spread of dots as we scan from left to right. This data is not heteroscedastic.

  • We can do regression on a dataset of any size, even a very small data set. Further, this dataset is decently large, since there are a good number of points in the residual plot.

  • The regression line is always the best-fitting line for any dataset. There may be other curves that are better fits than lines, but when we restrict to lines, the best of the bunch is the regression line.

  • We have no way of knowing how representative our data set is of the population. This is not something we can discern from a residual plot because such a plot contains no information about the population from which the data was drawn.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.



Problem 3

Suppose the price of an IKEA product and the cost to have it assembled are linearly associated with a correlation of 0.8. Product prices have a mean of 140 dollars and a standard deviation of 40 dollars. Assembly costs have a mean of 80 dollars and a standard deviation of 10 dollars. We want to predict the assembly cost of a product based on its price using linear regression.


Problem 3.1

The NORDMELA 4-drawer dresser sells for 200 dollars. How much do we predict its assembly cost to be?

Answer: 92 dollars

We first use the formulas for the slope, m, and intercept, b, of the regression line to find the equation. For our application, x is the price and y is the assembly cost since we want to predict the assembly cost based on price.

\begin{aligned} m &= r*\frac{\text{SD of }y}{\text{SD of }x} \\ &= 0.8*\frac{10}{40} \\ &= 0.2\\ b &= \text{mean of }y - m*\text{mean of }x \\ &= 80 - 0.2*140 \\ &= 80 - 28 \\ &= 52 \end{aligned}

Now we know the formula of the regression line and we simply plug in x=200 to find the associated y value.

\begin{aligned} y &= mx+b \\ y &= 0.2x+52 \\ &= 0.2*200+52 \\ &= 92 \end{aligned}


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 3.2

The IDANÄS wardrobe sells for 80 dollars more than the KLIPPAN loveseat, so we expect the IDANÄS wardrobe will have a greater assembly cost than the KLIPPAN loveseat. How much do we predict the difference in assembly costs to be?

Answer: 16 dollars

The slope of a line describes the change in y for each change of 1 in x. The difference in x values for these two products is 80, so the difference in y values is m*80 = 0.2*80 = 16 dollars.

An equivalent way to state this is:

\begin{aligned} m &= \frac{\text{ rise, or change in } y}{\text{ run, or change in } x} \\ 0.2 &= \frac{\text{ rise, or change in } y}{80} \\ 0.2*80 &= \text{ rise, or change in } y \\ 16 &= \text{ rise, or change in } y \end{aligned}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.


Problem 3.3

If we create a 95% prediction interval for the assembly cost of a 100 dollar product and another 95% prediction interval for the assembly cost of a 120 dollar product, which prediction interval will be wider?

Answer: The one for the 100 dollar product.

Prediction intervals get wider the further we get from the point (\text{mean of } x, \text{mean of } y) since all regression lines must go through this point. Since the average product price is 140 dollars, the prediction interval will be wider for the 100 dollar product, since it’s the further of 100 and 120 from 140.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



Source: sp24-final — Q12

Problem 4

Suppose you know the following information.

For all parts of this quesiton, give your answer as an integer.


Problem 4.1

Suppose the rents are normally distributed. What is the rent below which 84% of apartments are priced?

Answer: $3,400

We can use the 68-95-99.7 rule to approximate this answer. The 68-95-99.7 rule) is a handy shortcut for approximating how much data from a distribution lies below/above/within certain value ranges. It states that, for a normal distribution:

  • Roughly 68% of the data will lie within 1 standard deviation from the mean.
  • Roughly 95% of the data will lie within 2 standard deviations from the mean.
  • Roughly 99.7% of the data will lie within 3 standard deviations from the mean.

The bottom 84% percent of our apts data is roughly equivalent to “all data that lies below 1 standard deviation above the mean.” In this case, let the mean of our distribution be $3,000, and let the standard deviation be $400; the rent for which 84% of our apartments are priced is therefore $3,400.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 31%.


Problem 4.2

Sophie’s apartment rents for $5,000. What is this rent in standard units?

Answer: 5

Standard units (or Z-score) is the number of standard deviations an observation is away from the mean of a distribution. In this case, we want to find how many standard deviations ($400) that our observation ($5000) is away from the mean ($3000). The math works out to five standard deviations:

\frac{5000 - 3000}{400} = 5


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 4.3

Based on what you know about the rent of Sophie’s apartment, use the regression line to predict the square footage of Sophie’s apartment.

Answer: 2450

The correlation coefficient of 0.9 tells us about the slope of the regression line to predict square footage from rent; this means that “for every standard unit traveled right in the x-direction (rent), the regression line heads 0.9 standard units up in the y-direction (square footage).”

Sophie’s apartment rent is $5000 (or five standard units in the x-direction, rent). So, to get our regresion line prediction for the square footage of Sophie’s apartment, we should head 5 \cdot 0.9 = 4.5 standard units upwards from the mean in the y-direction, square footage. The standard deviation for square footage is $100; this implies that the prediction for Sophie’s apartment square footage should be 100 \cdot 4.5 = 450 square feet above the mean (2000 square feet), totaling to a final prediction of 2450 square feet.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 4.4

Sophie’s apartment is actually 2,300 square feet. What is the residual of your prediction?

Answer: -150

A residual just measures the difference between the observed and the predicted value. If our observation is 2300 square feet, and our prediction is 2450 square feet, our residual is then -150 square feet.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 4.5

Cici’s apartment is 1,800 square feet. Based on this information, use the regression line to predict the rent of Cici’s apartment.

Answer: $2,280

The correlation coefficient of 0.9 also tells us about the slope of the regression line to predict rent from square footage; this means that “for every standard unit traveled right in the x-direction (square footage), the regression line heads 0.9 standard units up in the y-direction (rent).”

Cici’s apartment square footage is 1,800 square feet (or negative two standard units in the x-direction, square footage). So, to get our regresion line prediction for the rent of Cici’s apartment, we should head -2 \cdot 0.9 = -1.8 standard units from the mean in the y-direction, rent. The standard deviation for rent is $400; this implies that the prediction for Cici’s apartment rent should be 400 \cdot -1.8 = 720 square feet below the mean (3000 dollars), totaling to a final prediction of $2280.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 23%.



Problem 5

Sam wants to fit a linear model to predict a dog’s height using its weight.

He first runs the following code:

x = df.get('weight')
y = df.get('height')

def su(vals):
    return (vals - vals.mean()) / np.std(vals)


Problem 5.1

Select all of the Python snippets that correctly compute the correlation coefficient into the variable r.

Snippet 1:

r = (su(x) * su(y)).mean()

Snippet 2:

r = su(x * y).mean()

Snippet 3:

t = 0
for i in range(len(x)):
    t = t + su(x[i]) * su(y[i])
r = t / len(x)

Snippet 4:

t = np.array([])
for i in range(len(x)):
    t = np.append(t, su(x)[i] * su(y)[i])
r = t.mean()

Answer: Snippet 1 & 4

  • Snippet 1: Recall from the reference sheet, the correlation coefficient is r = (su(x) * su(y)).mean().

  • Snippet 2: We have to standardize each variable seperately so this snippet doesnt work.

  • Snippet 3: Note that for this snippet we’re standardizing each data point within each variable seperately, and so we’re not really standardizing the entire variable correctly. In other words, applying su(x[i]) to a singular data point is just going to convert this data point to zero, since we’re only inputting one data point into su().

  • Snippet 4: Note that this code is just the same as Snippet 1, except we’re now directly computing the product of each corresponding data points individually. Hence this Snippet works.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 5.2

Sam computes the following statistics for his sample:

The best-fit line predicts that a dog with a weight of 10 kg has a height of 45 cm.

What is the SD of dog heights?

Answer: Option 3: 10

The best fit line in original units are given by y = mx + b where m = r * (SD of y) / (SD of x) and b = (mean of y) - m * (mean of x) (refer to reference sheet). Let c be the STD of y, which we’re trying to find, then our best fit line is now y = (0.8*c/8)x + (50-(0.8*c/8)*15). Plugging the two values they gave us into our best fit line and simplifying gives 45 = 0.1*c*10 + (50 - 1.5*c) which simplifies to 45 = 50 - 0.5*c which gives us an answer of c = 10.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 5.3

Assume that the statistics in part b) still hold. Select all of the statements below that are true. (You don’t need to finish part b) in order to solve this question.)

Answer: Option 3 & 4

  • Option 1: We cannot determine whether two variables are linear simply from a line of best fit. The line of best fit just happens to find the best linear relationship between two varaibles, not whether or not the variables have a linear relationship.

  • Option 2: To calculate the root mean squared error, we need the actual data points so we can calculate residual values. Seeing that we don’t have access to the data points, we cannot say that the root mean squared error of the best-fit line is smaller than 5.

  • Option 3: This is true accrding to the problem statement given in part b

  • Option 4: This is true since we expect there to be a positive correlation between dog height and weight. So dogs that are lighter will also most likely be shorter. (ie a dog that is lighter than 15 kg will most likely be shorter than 50cm)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 6

Let’s switch our attention to the relationship between the number of points per game and the number of assists per game for all players in season. Using season, we compute the following information:


Problem 6.1

Let’s start by using points per game (x) to predict assists per game (y).

Tina Charles had 27 points per game in 2021, the most of any player in the WNBA. What is her predicted assists per game, according to the regression line? Round your answer to 3 decimal places.

Answer: 5.4

We need to find and use the regression line to find the predicted y for an x of 27. There are two ways to proceed:

  1. Use the regression line in standard units. To do this, we’d need to convert 27 from original units to standard units, use the regression line y_\text{su} = r \cdot x_\text{su}, and convert the output back to original units.
  2. Use the regression line in original units. To do this, we’d need to find the slope m and intercept b in the regression line y = mx + b, using the formulas m = r \cdot \frac{\text{SD of }y }{\text{SD of }x} and b = \text{mean of }y - m \cdot \text{mean of }x.

Both solutions work; for the sake of completeness, we’ll show both. Recall, r is the correlation coefficient between x and y, which we are told is 0.65.

Solution 1:

First, we need to convert 27 points per game to standard units. Doing so yields

x_{\text{su}} = \frac{x - \text{mean of }x}{\text{SD of }x} = \frac{27 - 7}{5} = 4

Per the regression line, y_\text{su} = r \cdot x_\text{su}, we have y_\text{su} = 0.65 \cdot 4 = 2.6, which is Tina Charles’ predicted assists per game in standard units. All that’s left is to convert this value back to original units:

\begin{aligned} y_{\text{su}} &= \frac{y - \text{mean of }y}{\text{SD of }y} \\ 2.6 &= \frac{y - 1.5}{1.5} \\ 2.6 \cdot 1.5 + 1.5 &= y \\ y &= \boxed{5.4} \end{aligned}

So, the regression line predicts Tina Charles will have 5.4 assists per game (in original units).


Solution 2:

First, we need to find the slope m and intercept b:

m = r \cdot \frac{\text{SD of }y }{\text{SD of }x} = 0.65 \cdot \frac{1.5}{5} = 0.195

b = \text{mean of }y - m \cdot \text{mean of }x = 1.5 - 0.195 \cdot 7 = 0.135

Then,

y = mx + b \implies y = 0.195 \cdot 27 + 0.135 = \boxed{5.4}

So, once again, the regression line predicts Tina Charles will have 5.4 assists per game.

Note: The numbers in this problem may seem ugly, but students taking this exam had access to calculators since this exam was online. It also turns out that the numbers were easier to work with in Solution 1 over Solution 2; this was intentional.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 6.2

Tina Charles actually had 2.1 assists per game in the 2021 season.

What is the error, or residual, for the prediction in the previous subpart? Round your answer to 3 decimal places.

Answer: -3.3

Residuals are defined as follows:

\text{residual} = \text{actual } y - \text{predicted }y

2.1 - 5.4 = -3.3, which gives us our answer.

Note: Many students answered 3.3. Pay attention to the order of the calculation!


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 6.3

Select all true statements below regarding the regression line between points per game (x) and assists per game (y).

Answers:

  • The point (0, 0) is guaranteed to be on the regression line when both x and y are in standard units (Option 1).
  • The point (7, 1.5) is guaranteed to be on the regression line when both x and y are in original units (Option 4).

The main idea being assessed here is the fact that the point (\text{mean of }x, \text{mean of }y) always lies on the regression line. Indeed, in original units, 7 is the average x (PPG) and 1.5 is the average y (APG); this information was provided to us at the start of the problem. The nuance behind this problem lies in the units that are being used in the regression line.

When the regression line is in standard units:

  • In standard units, 0 means “0 standard deviations above the average”, i.e. 0 means “average”. When the regression line is in standard units, we have y_\text{su} = r \cdot x_\text{su}. If x is average, i.e. if x_\text{su} = 0, then y_\text{su} = r \cdot x_\text{su} = r \cdot 0 = 0, regardless of what r is. So the point (0, 0) is on the regression line when both x and y are in standard units, meaning that Option 1 is correct.
  • The point (7, 1.5) is not on the regression line when x and y are in standard units. We know this because in this problem r = 0.65, and if x_\text{su} = 7, then y_\text{su} = r \cdot x_\text{su} = 0.65 \cdot 7 = 4.55 \neq 1.5. This means that Option 3 is incorrect.

When the regression line is in original units:

  • In original units, the average x is 7 and the average y is 1.5. From class, we may remember that this automatically means that (7, 1.5) is on the regression line in original units. If we didn’t remember that, we can look to the formula for the slope m and intercept b in y = mx + b. The formula for the slope is actually not relevant here; what’s relevant is the fact that b = \text{mean of }y - m \cdot \text{mean of }x. Substituting the formula for b into y = mx + b yields

y = mx + b = mx + \text{mean of }y - m \cdot \text{mean of }x

  • If x = \text{mean of }x, then the above simplifies to: y = m \cdot \text{mean of }x + \text{mean of }y - m \cdot \text{mean of }x = \text{mean of }y, meaning that (\text{mean of }x, \text{mean of }y) — which is (7, 1.5) in this case — is on the regression line in original units, so Option 4 is correct.
  • In the above equation, if x = 0, then y = \text{mean of }y - m \cdot \text{mean of }x, which in this case simplifies to 1.5 - 0.195 \cdot 7 = 0.135 \neq 0. This means that (0, 0) is not on the regression line in original units and Option 2 is incorrect.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 6.4

So far, we’ve been using points per game (x) to predict assists per game (y). Suppose we found the regression line (when both x and y are in original units) to be y = ax + b.

Now, let’s reverse x and y. That is, we will now use assists per game (x) to predict points per game (y). The resulting regression line (when both x and y are in original units) is y = cx + d.

Which of the following statements is guaranteed to be true?

Answer: a < c

The formula for the slope of the regression line is m = r \cdot \frac{\text{SD of }y}{\text{SD of }x}. Note that the correlation coefficient r is symmetric, meaning that the correlation between x and y is the same as the correlation between y and x.

In the two regression lines mentioned in this problem, we have

\begin{aligned} a &= r \cdot \frac{\text{SD of assists per game}}{\text{SD of points per game}} \\ c &= r \cdot \frac{\text{SD of points per game}}{\text{SD of assists per game}} \end{aligned}

We’re told in the problem that the SD of points per game is 5 and the SD of assists per game is 1.5. So, a = r \cdot \frac{1.5}{5} and c = r \cdot \frac{5}{1.5}; since \frac{1.5}{5} < \frac{5}{1.5}, a < c.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.



Source: wi25-final — Q8

Problem 7

Professor Filius Flitwick is conducting a study whose results will be used to help new Hogwarts students select appropriately sized broomsticks for their flying lessons. Professor Flitwick measures several wizards’ heights and broomstick lengths, both in centimeters. Since broomsticks can only be purchases in specific lengths, the scatterplot of broomstick length vs. height has a pattern of horizontal stripes:

If we group the wizards in Professor Flitwick’s research study by their broomstick length, and average the heights of the wizards in each group, we get the following results.

It turns out that the regression line that predicts broomstick length (y) based on wizard height (x) passes through the four points representing the means of each group. For example, the first row of the DataFrame above means that (165, 150) is a point on the regression line, as you can see in the scatterplot.


Problem 7.1

Based only on the fact that the regression line goes through these points, which of the following could represent the relationship between the standard deviation of broomstick length (y) and wizard height (x)? Select all that apply.

Answer: Options 4 and 5.

To solve this problem, we use the relationship between the slope of the regression line, the correlation coefficient r, and the standard deviations:

\text{slope} = r \cdot \frac{\text{SD}(y)}{\text{SD}(x)}

From the mean points given, we can calculate the slope:

\frac{225 - 150}{187.5 - 165.0} = \frac{75}{22.5} = \frac{10}{3}

We set up the equation:

r \cdot \frac{\text{SD}(y)}{\text{SD}(x)} = \frac{10}{3}

Now consider each option:

  • If \text{SD}(y) = \text{SD}(x): r = \frac{10}{3} \text{(not valid, since } r > 1\text{)}

  • If \text{SD}(y) = 2 \cdot \text{SD}(x): r \cdot 2 = \frac{10}{3} \Rightarrow r = \frac{5}{3} \approx 1.67 \quad \text{(not valid, since } r > 1\text{)}

  • If \text{SD}(y) = 3 \cdot \text{SD}(x): r \cdot 3 = \frac{10}{3} \Rightarrow r = \frac{10}{9} \approx 1.11 \quad \text{(not valid, since } r > 1\text{)}

  • If \text{SD}(y) = 4 \cdot \text{SD}(x): r \cdot 4 = \frac{10}{3} \Rightarrow r = \frac{10}{12} = \frac{5}{6} \approx 0.833 \quad \text{(valid)}

  • If \text{SD}(y) = 5 \cdot \text{SD}(x): r \cdot 5 = \frac{10}{3} \Rightarrow r = \frac{10}{15} = \frac{2}{3} \approx 0.667 \quad \text{(valid)}

Therefore, \text{SD}(y) = 4 \cdot \text{SD}(x) and \text{SD}(y) = 5 \cdot \text{SD}(x) are the only valid options.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 7.2

Now suppose you know that SD(y) = 3.5 \cdot SD(x). What is the correlation coefficient, r, between these variables? Give your answer as a simplified fraction.

Answer: \frac{20}{21}

We use the formula for slope:

\text{slope} = r \cdot \frac{\text{SD}(y)}{\text{SD}(x)}

From the mean points given, we can calculate the slope:

\frac{225 - 150}{187.5 - 165.0} = \frac{75}{22.5} = \frac{10}{3}

Since \text{SD}(y) = 3.5 \cdot \text{SD}(x), we plug this into the slope formula:

r \cdot 3.5 = \frac{10}{3}

Solving for r:

\begin{align*} r &= \frac{10}{3} \cdot \frac{1}{3.5} \\ &= \frac{10}{3} \cdot \frac{2}{7} \\ &= \frac{20}{21} \end{align*}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.


Problem 7.3

Suppose we convert all wizard heights from centimeters to inches (1 inch = 2.54 cm). Which of the following will change? Select all that apply.

Answer: Options 1, 4. and 5.

  • Standard deviation of wizard heights will change because heights are rescaled by a factor of 1/2.54.
  • Proportion within 3 standard deviations does NOT change since z-scores are unitless and the transformation is linear.
  • Correlation remains unchanged since it’s a unitless measure of linear relationship.
  • Slope predicting broom length from height will change because the x-variable (height) is rescaled while y remains the same.
  • Slope predicting height from broom length will change because the output is now in different units.

Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 7.4

Suppose we convert all wizard heights and all broomstick lengths from centimeters to inches (1 inch = 2.54 cm). Which of the following will change, as compared to the original data when both variables were measured in centimeters? Select all that apply.

Answer: None of the above

  • Correlation remains the same because it’s unitless.
  • Both slopes are unchanged because both SD(y) and SD(x) are rescaled by the same factor (1/2.54), so the ratio SD(y)/SD(x) stays the same. When both variables are converted by the same factor, their relative relationship remains unchanged.

Difficulty: ⭐️

The average score on this problem was 95%.


Problem 7.5

Professor Flitwick calculates the root mean square error (RMSE) for his regression line to be 36 cm. What does this RMSE value suggest about the accuracy of the regression line’s broomstick length predictions?

Answer: None of the above

RMSE is the square root of the average squared differences between predicted and actual values. None of the options accurately describes what RMSE represents because:

  • It’s not the average absolute error (which would be MAE, not RMSE)
  • It doesn’t mean every residual equals 36 cm
  • It doesn’t tell us that predictions vary by wizard height
  • It does provide information about prediction accuracy, but not in the ways described

RMSE gives us the typical size of the error in the same units as the response variable. It tells us that the typical prediction error is around 36 cm, but this is not the same as any of the given options.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 7%.


Problem 7.6

Which of the following plots is the residual plot for Professor Flitwick’s data?

Answer: Option B

A residual plot shows the difference between actual and predicted values plotted against the predictor variable (x).

Since broomsticks come in specific sizes (150, 175, 200, 225 cm), the residuals will form slanted lines across the x axis.

We can immediately rule out Option C, as all the points lie on 4 specific wizard heights, which is totally different from the original plot.

Now, if we were to pick a point:

,

We see that this point is above the line, meaning the difference between actual and predicted is positive (actual - predicted > 0)

Thus, if we were to check the point on option A or B, we see that option B’s graph corresponds with the original.

  • Option A

  • Option B ,


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.