Discussion 10: Regression

← return to practice.dsc10.com


The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1


Problem 1.1

True or False: The slope of the regression line, when both variables are measured in standard units, is never more than 1.


Problem 1.2

True or False: The slope of the regression line, when both variables are measured in original units, is never more than 1.



Problem 2

Let’s study the relationship between a penguin’s bill length (in millimeters) and mass (in grams). Suppose we’re given that


Problem 2.1

Which of the four scatter plots below describe the relationship between bill length and body mass, based on the information provided in the question?



Problem 2.2

Suppose we want to find the regression line that uses bill length, x, to predict body mass, y. The line is of the form y = mx +\ b. What are m and b?

What is m? Give your answer as a number without any units, rounded to three decimal places.

What is b? Give your answer as a number without units, rounded to three decimal places.


Problem 2.3

What is the predicted body mass (in grams) of a penguin whose bill length is 44 mm? Give your answer as a number without any units, rounded to three decimal places.


Problem 2.4

A particular penguin had a predicted body mass of 6800 grams. What is that penguin’s bill length (in mm)? Give your answer as a number without any units, rounded to three decimal places.


Problem 2.5

Below is the residual plot for our regression line.


Which of the following is a valid conclusion that we can draw solely from the residual plot above?



Problem 3

Suppose the price of an IKEA product and the cost to have it assembled are linearly associated with a correlation of 0.8. Product prices have a mean of 140 dollars and a standard deviation of 40 dollars. Assembly costs have a mean of 80 dollars and a standard deviation of 10 dollars. We want to predict the assembly cost of a product based on its price using linear regression.


Problem 3.1

The NORDMELA 4-drawer dresser sells for 200 dollars. How much do we predict its assembly cost to be?


Problem 3.2

The IDANÄS wardrobe sells for 80 dollars more than the KLIPPAN loveseat, so we expect the IDANÄS wardrobe will have a greater assembly cost than the KLIPPAN loveseat. How much do we predict the difference in assembly costs to be?


Problem 3.3

If we create a 95% prediction interval for the assembly cost of a 100 dollar product and another 95% prediction interval for the assembly cost of a 120 dollar product, which prediction interval will be wider?



Problem 4

In this question, we’ll explore the relationship between the ages and incomes of credit card applicants.


Problem 4.1

The credit card company that owns the data in apps, BruinCard, has decided not to give us access to the entire apps DataFrame, but instead just a sample of apps called small apps. We’ll start by using the information in small_apps to compute the regression line that predicts the age of an applicant given their income.

For an applicant with an income that is \frac{8}{3} standard deviations above the mean income, we predict their age to be \frac{4}{5} standard deviations above the mean age. What is the correlation coefficient, r, between incomes and ages in small_apps? Give your answer as a fully simplified fraction.


Problem 4.2

Now, we want to predict the income of an applicant given their age. We will again use the information in small_apps to find the regression line. The regression line predicts that an applicant whose age is \frac{4}{5} standard deviations above the mean age has an income that is s standard deviations above the mean income. What is the value of s? Give your answer as a fully simplified fraction.


Problem 4.3

BruinCard has now taken away our access to both apps and small_apps, and has instead given us access to an even smaller sample of apps called mini_apps. In mini_apps, we know the following information: - All incomes and ages are positive numbers. - There is a positive linear association between incomes and ages.

We use the data in mini_apps to find the regression line that will allow us to predict the income of an applicant given their age. Just to test the limits of this regression line, we use it to predict the income of an applicant who is -2 years old, even though it doesn’t make sense for a person to have a negative age.

Let I be the regression line’s prediction of this applicant’s income. Which of the following inequalities are guaranteed to be satisfied? Select all that apply.


Problem 4.4

Yet again, BruinCard, the company that gave us access to apps, small_apps, and mini_apps, has revoked our access to those three DataFrames and instead has given us micro_apps, an even smaller sample of apps.

Using micro_apps, we are again interested in finding the regression line that will allow us to predict the income of an applicant given their age. We are given the following information:

Suppose the standard deviation of incomes in micro_apps is an integer multiple of the standard deviation of ages in micro_apps. That is,

\text{standard deviation of income} = k \cdot \text{standard deviation of age}.

What is the value of k? Give your answer as an integer.



Problem 5

Raine is helping settle a debate between two friends on the “superior" season — winter or summer. In doing so, they try to understand the relationship between the number of sunshine hours per month in January and the number of sunshine hours per month in July across all cities in California in sun.

Raine finds the regression line that predicts the number of sunshine hours in July (y) for a city given its number of sunshine hours in January (x). In doing so, they find that the correlation between the two variables is \frac{2}{5}.


Problem 5.1

Which of these could be a scatter plot of number of sunshine hours in July vs. number of sunshine hours in January?



Problem 5.2

Suppose the standard deviation of the number of sunshine hours in January for cities in California is equal to the standard deviation of the number of sunshine hours in July for cities in California.

Raine’s hometown of Santa Clarita saw 60 more sunshine hours in January than the average California city did. How many more sunshine hours than average does the regression line predict that Santa Clarita will have in July? Give your answer as a positive integer. (Hint: You’ll need to use the fact that the correlation between the two variables is \frac{2}{5}.)


As we know, San Diego was particularly cloudy this May. More generally, Anthony, another California native, feels that California is getting cloudier and cloudier overall.

To imagine what the dataset may look like in a few years, Anthony subtracts 5 from the number of sunshine hours in both January and July for all California cities in the dataset – i.e., he subtracts 5 from each x value and 5 from each y value in the dataset. He then creates a regression line to use the new xs to predict the new ys.


Problem 5.3

What is the slope of Anthony’s new regression line?



Problem 5.4

Suppose the intercept of Raine’s original regression line – that is, before Anthony subtracted 5 from each x and each y – was 10. What is the intercept of Anthony’s new regression line?


Jasmine is trying to get as far away from Anthony as possible and has a trip to Chicago planned after finals. Chicago is known for being very warm and sunny in the summer but cold, rainy, and snowy in the winter. She decides to build a regression line that uses month of the year (where 1 is January, 2 is February, 12 is December, etc.) to predict the number of sunshine hours in Chicago.


Problem 6

The DataFrame games contains information about a sample of popular games. Besides other columns, there is a column "Complexity" that contains the average complexity of the game, a column "Rating" that contains the average rating of the game, and a column "Play Time" that contains the average play time of the game.

We use the regression line to predict a game’s "Rating" based on its "Complexity". We find that for the game Wingspan, which has a "Complexity" that is 2 points higher than the average, the predicted "Rating" is 3 points higher than the average.


Problem 6.1

What can you conclude about the correlation coefficient r?


Problem 6.2

What can you conclude about the standard deviations of “Complexity” and “Rating”?



Problem 7

Suppose that for children’s games, "Play Time" and "Rating" are negatively linearly associated due to children having short attention spans. Suppose that for children’s games, the standard deviation of "Play Time" is twice the standard deviation of "Rating", and the average "Play Time" is 10 minutes. We use linear regression to predict the "Rating" of a children’s game based on its "Play Time". The regression line predicts that Don’t Break the Ice, a children’s game with a "Play Time" of 8 minutes will have a "Rating" of 4. Which of the following could be the average "Rating" for children’s games?


Problem 8

The American Kennel Club (AKC) organizes information about dog breeds. We’ve loaded their dataset into a DataFrame called df. The index of df contains the dog breed names as str values. Besides other columns, there is a column 'weight' (float) that contains typical weight (kg) and a column 'height' (float) that contains typical height (cm).

Sam wants to fit a linear model to predict a dog’s height using its weight.

He first runs the following code:

x = df.get('weight')
y = df.get('height')

def su(vals):
    return (vals - vals.mean()) / np.std(vals)


Problem 8.1

Select all of the Python snippets that correctly compute the correlation coefficient into the variable r.

Snippet 1:

r = (su(x) * su(y)).mean()

Snippet 2:

r = su(x * y).mean()

Snippet 3:

t = 0
for i in range(len(x)):
    t = t + su(x[i]) * su(y[i])
r = t / len(x)

Snippet 4:

t = np.array([])
for i in range(len(x)):
    t = np.append(t, su(x)[i] * su(y)[i])
r = t.mean()


Problem 8.2

Sam computes the following statistics for his sample:

The best-fit line predicts that a dog with a weight of 10 kg has a height of 45 cm.

What is the SD of dog heights?


Problem 8.3

Assume that the statistics in part b) still hold. Select all of the statements below that are true. (You don’t need to finish part b) in order to solve this question.)



Problem 9


Problem 9.1

Are nonfiction books longer than fiction books?

Choose the best data science tool to help you answer this question.


Problem 9.2

Do people have more friends as they get older?

Choose the best data science tool to help you answer this question.


Problem 9.3

Does an ice cream shop sell more chocolate or vanilla ice cream cones?

Choose the best data science tool to help you answer this question.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.