← return to practice.dsc10.com
The problems in this worksheet are taken from past exams. Work on
them on paper, since the exams you take in this course
will also be on paper.
We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all
problems here in the live discussion section; the problems we don’t
cover can be used for extra practice.
True or False: The slope of the regression line, when both variables are measured in standard units, is never more than 1.
True or False: The slope of the regression line, when both variables are measured in original units, is never more than 1.
Let’s study the relationship between a penguin’s bill length (in millimeters) and mass (in grams). Suppose we’re given that
Which of the four scatter plots below describe the relationship between bill length and body mass, based on the information provided in the question?
Option 1
Option 2
Option 3
Option 4
Suppose we want to find the regression line that uses bill length, x, to predict body mass, y. The line is of the form y = mx +\ b. What are m and b?
What is m? Give your answer as a number without any units, rounded to three decimal places.
What is b? Give your answer as a number without units, rounded to three decimal places.
What is the predicted body mass (in grams) of a penguin whose bill length is 44 mm? Give your answer as a number without any units, rounded to three decimal places.
A particular penguin had a predicted body mass of 6800 grams. What is that penguin’s bill length (in mm)? Give your answer as a number without any units, rounded to three decimal places.
Below is the residual plot for our regression line.
Which of the following is a valid conclusion that we can draw solely from the residual plot above?
For this dataset, there is another line with a lower root mean squared error
The root mean squared error of the regression line is 0
The accuracy of the regression line’s predictions depends on bill length
The relationship between bill length and body mass is likely non-linear
None of the above
Suppose the price of an IKEA product and the cost to have it assembled are linearly associated with a correlation of 0.8. Product prices have a mean of 140 dollars and a standard deviation of 40 dollars. Assembly costs have a mean of 80 dollars and a standard deviation of 10 dollars. We want to predict the assembly cost of a product based on its price using linear regression.
The NORDMELA 4-drawer dresser sells for 200 dollars. How much do we predict its assembly cost to be?
The IDANÄS wardrobe sells for 80 dollars more than the KLIPPAN loveseat, so we expect the IDANÄS wardrobe will have a greater assembly cost than the KLIPPAN loveseat. How much do we predict the difference in assembly costs to be?
If we create a 95% prediction interval for the assembly cost of a 100 dollar product and another 95% prediction interval for the assembly cost of a 120 dollar product, which prediction interval will be wider?
The one for the 100 dollar product.
The one for the 120 dollar product.
In this question, we’ll explore the relationship between the ages and incomes of credit card applicants.
The credit card company that owns the data in apps, BruinCard, has
decided not to give us access to the entire apps
DataFrame,
but instead just a sample of apps
called
small apps
. We’ll start by using the information in
small_apps
to compute the regression line that predicts the
age of an applicant given their income.
For an applicant with an income that is \frac{8}{3} standard deviations above the
mean income, we predict their age to be \frac{4}{5} standard deviations above the
mean age. What is the correlation coefficient, r, between incomes and ages in
small_apps
? Give your answer as a fully simplified
fraction.
Now, we want to predict the income of an applicant given their age.
We will again use the information in small_apps
to find the
regression line. The regression line predicts that an applicant whose
age is \frac{4}{5} standard deviations
above the mean age has an income that is s standard deviations above the mean income.
What is the value of s? Give your
answer as a fully simplified fraction.
BruinCard has now taken away our access to both apps
and
small_apps
, and has instead given us access to an even
smaller sample of apps
called mini_apps
. In
mini_apps
, we know the following information: - All incomes
and ages are positive numbers. - There is a positive linear association
between incomes and ages.
We use the data in mini_apps
to find the regression line
that will allow us to predict the income of an applicant given their
age. Just to test the limits of this regression line, we use it to
predict the income of an applicant who is -2 years old,
even though it doesn’t make sense for a person to have a negative
age.
Let I be the regression line’s prediction of this applicant’s income. Which of the following inequalities are guaranteed to be satisfied? Select all that apply.
I < 0
I < \text{mean income}
| I - \text{mean income}| \leq | \text{mean age} + 2 |
\dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} \leq \dfrac{| \text{mean age} + 2 |}{\text{standard deviation of ages}}
None of the above.
Yet again, BruinCard, the company that gave us access to
apps
, small_apps
, and mini_apps
,
has revoked our access to those three DataFrames and instead has given
us micro_apps
, an even smaller sample of
apps
.
Using micro_apps
, we are again interested in finding the
regression line that will allow us to predict the income of an applicant
given their age. We are given the following information:
Suppose the standard deviation of incomes in micro_apps
is an integer multiple of the standard deviation of ages in
micro_apps
. That is,
\text{standard deviation of income} = k \cdot \text{standard deviation of age}.
What is the value of k? Give your answer as an integer.
Raine is helping settle a debate between two friends on the
“superior" season — winter or summer. In doing so, they try to
understand the relationship between the number of sunshine hours per
month in January and the number of sunshine hours per month in July
across all cities in California in sun
.
Raine finds the regression line that predicts the number of sunshine hours in July (y) for a city given its number of sunshine hours in January (x). In doing so, they find that the correlation between the two variables is \frac{2}{5}.
Which of these could be a scatter plot of number of sunshine hours in July vs. number of sunshine hours in January?
Option 1
Option 2
Option 3
Option 4
Suppose the standard deviation of the number of sunshine hours in January for cities in California is equal to the standard deviation of the number of sunshine hours in July for cities in California.
Raine’s hometown of Santa Clarita saw 60 more sunshine hours in January than the average California city did. How many more sunshine hours than average does the regression line predict that Santa Clarita will have in July? Give your answer as a positive integer. (Hint: You’ll need to use the fact that the correlation between the two variables is \frac{2}{5}.)
As we know, San Diego was particularly cloudy this May. More generally, Anthony, another California native, feels that California is getting cloudier and cloudier overall.
To imagine what the dataset may look like in a few years, Anthony subtracts 5 from the number of sunshine hours in both January and July for all California cities in the dataset – i.e., he subtracts 5 from each x value and 5 from each y value in the dataset. He then creates a regression line to use the new xs to predict the new ys.
What is the slope of Anthony’s new regression line?
Suppose the intercept of Raine’s original regression line – that is, before Anthony subtracted 5 from each x and each y – was 10. What is the intercept of Anthony’s new regression line?
-7
-5
-3
0
3
5
7
Jasmine is trying to get as far away from Anthony as possible and has a trip to Chicago planned after finals. Chicago is known for being very warm and sunny in the summer but cold, rainy, and snowy in the winter. She decides to build a regression line that uses month of the year (where 1 is January, 2 is February, 12 is December, etc.) to predict the number of sunshine hours in Chicago.
The DataFrame games
contains information about a sample
of popular games. Besides other columns, there is a column
"Complexity"
that contains the average complexity of the
game, a column "Rating"
that contains the average rating of
the game, and a column "Play Time"
that contains the
average play time of the game.
We use the regression line to predict a game’s "Rating"
based on its "Complexity"
. We find that for the game
Wingspan, which has a "Complexity"
that is 2
points higher than the average, the predicted "Rating"
is 3
points higher than the average.
What can you conclude about the correlation coefficient r?
r < 0
r = 0
r > 0
We cannot make any conclusions about the value of r based on this information alone.
What can you conclude about the standard deviations of “Complexity” and “Rating”?
SD of "Complexity"
< SD of "Rating"
SD of "Complexity"
= SD of "Rating"
SD of "Complexity"
> SD of "Rating"
We cannot make any conclusions about the relationship between these two standard deviations based on this information alone.
Suppose that for children’s games, "Play Time"
and
"Rating"
are negatively linearly associated due to children
having short attention spans. Suppose that for children’s games, the
standard deviation of "Play Time"
is twice the standard
deviation of "Rating"
, and the average
"Play Time"
is 10 minutes. We use linear regression to
predict the "Rating"
of a children’s game based on its
"Play Time"
. The regression line predicts that Don’t
Break the Ice, a children’s game with a "Play Time"
of
8 minutes will have a "Rating"
of 4. Which of the following
could be the average "Rating"
for children’s games?
2
2.8
3.1
4
The American Kennel Club (AKC) organizes information about dog
breeds. We’ve loaded their dataset into a DataFrame called
df
. The index of df
contains the dog breed
names as str
values. Besides other columns, there is a
column 'weight' (float)
that contains typical weight (kg)
and a column 'height' (float)
that contains typical height
(cm).
Sam wants to fit a linear model to predict a dog’s height
using its weight.
He first runs the following code:
= df.get('weight')
x = df.get('height')
y
def su(vals):
return (vals - vals.mean()) / np.std(vals)
Select all of the Python snippets that correctly compute the
correlation coefficient into the variable r
.
Snippet 1:
= (su(x) * su(y)).mean() r
Snippet 2:
= su(x * y).mean() r
Snippet 3:
= 0
t for i in range(len(x)):
= t + su(x[i]) * su(y[i])
t = t / len(x) r
Snippet 4:
= np.array([])
t for i in range(len(x)):
= np.append(t, su(x)[i] * su(y)[i])
t = t.mean() r
Snippet 1
Snippet 2
Snippet 3
Snippet 4
Sam computes the following statistics for his sample:
The best-fit line predicts that a dog with a weight of 10 kg has a height of 45 cm.
What is the SD of dog heights?
2
4.5
10
25
45
None of the above
Assume that the statistics in part b) still hold. Select all of the statements below that are true. (You don’t need to finish part b) in order to solve this question.)
The relationship between dog weight and height is linear.
The root mean squared error of the best-fit line is smaller than 5.
The best-fit line predicts that a dog that weighs 15 kg will be 50 cm tall.
The best-fit line predicts that a dog that weighs 10 kg will be shorter than 50 cm.
Are nonfiction books longer than fiction books?
Choose the best data science tool to help you answer this question.
hypothesis testing
permutation (A/B) testing
Central Limit Theorem
regression
Do people have more friends as they get older?
Choose the best data science tool to help you answer this question.
hypothesis testing
permutation (A/B) testing
Central Limit Theorem
regression
Does an ice cream shop sell more chocolate or vanilla ice cream cones?
Choose the best data science tool to help you answer this question.
hypothesis testing
permutation (A/B) testing
Central Limit Theorem
regression