← return to practice.dsc10.com
Welcome! The problems shown below should be worked on on
paper, since the quizzes and exams you take in this course will
also be on paper. You do not need to submit your solutions anywhere.
We encourage you to complete this worksheet in groups during an
extra practice session on Friday, March 8th. Solutions will be posted
after all sessions have finished. This problem set is not designed to
take any particular amount of time - focus on understanding concepts,
not on getting through all the questions.
For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.
Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code 'LVA'
). In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.
The DataFrame plum
contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of plum
are shown below (though the full DataFrame has 31 rows, not 5):
Each row in plum
corresponds to a single game. For each
game, we have:
'Date'
(str
), the date on which the game
was played'Opp'
(str
), the three-letter code of the
opponent team'Home'
(bool
), True
if the
game was played in Las Vegas (“home”) and False
if it was
played at the opponent’s arena (“away”)'Won'
(bool
), True
if the Las
Vegas Aces won the game and False
if they lost'PTS'
(int
), the number of points Kelsey
Plum scored in the game'AST'
(int
), the number of assists
(passes) Kelsey Plum made in the game'TOV'
(int
), the number of turnovers
Kelsey Plum made in the game (a turnover is when you lose the ball –
turnovers are bad!) Consider the definition of the function
diff_in_group_means
:
def diff_in_group_means(df, group_col, num_col):
= df.groupby(group_col).mean().get(num_col)
s return s.loc[False] - s.loc[True]
It turns out that Kelsey Plum averages 0.61 more assists in games
that she wins (“winning games”) than in games that she loses (“losing
games”). Fill in the blanks below so that observed_diff
evaluates to -0.61.
= diff_in_group_means(plum, __(a)__, __(b)__) observed_diff
What goes in blank (a)?
What goes in blank (b)?
Answers: 'Won'
, 'AST'
To compute the number of assists Kelsey Plum averages in winning and
losing games, we need to group by 'Won'
. Once doing so, and
using the .mean()
aggregation method, we need to access
elements in the 'AST'
column.
The second argument to diff_in_group_means
,
group_col
, is the column we’re grouping by, and so blank
(a) must be filled by 'Won'
. Then, the second argument,
num_col
, must be 'AST'
.
Note that after extracting the Series containing the average number
of assists in wins and losses, we are returning the value with the index
False
(“loss”) minus the value with the index
True
(“win”). So, throughout this problem, keep in mind
that we are computing “losses minus wins”. Since our observation was
that she averaged 0.61 more assists in wins than in losses, it makes
sense that diff_in_group_means(plum, 'Won', 'AST')
is -0.61
(rather than +0.61).
The average score on this problem was 94%.
After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:
To conduct our permutation test, we place the following code in a
for
-loop.
= plum.get('Won')
won = plum.get('AST')
ast = plum.assign(Won_shuffled=np.random.permutation(won)) \
shuffled =np.random.permutation(ast)) .assign(AST_shuffled
Which of the following options does not compute a valid simulated test statistic for this permutation test?
diff_in_group_means(shuffled, 'Won', 'AST')
diff_in_group_means(shuffled, 'Won', 'AST_shuffled')
diff_in_group_means(shuffled, 'Won_shuffled, 'AST')
diff_in_group_means(shuffled, 'Won_shuffled, 'AST_shuffled')
More than one of these options do not compute a valid simulated test statistic for this permutation test
Answer:
diff_in_group_means(shuffled, 'Won', 'AST')
As we saw in the previous subpart,
diff_in_group_means(shuffled, 'Won', 'AST')
computes the
observed test statistic, which is -0.61. There is no randomness involved
in the observed test statistic; each time we run the line
diff_in_group_means(shuffled, 'Won', 'AST')
we will see the
same result, so this cannot be used for simulation.
To perform a permutation test here, we need to simulate under the null by randomly assigning assist counts to groups; here, the groups are “win” and “loss”.
As such, Options 2 through 4 are all valid, and Option 1 is the only invalid one.
The average score on this problem was 68%.
Suppose we generate 10,000 simulated test statistics, using one of
the valid options from part 1. The empirical distribution of test
statistics, with a red line at observed_diff
, is shown
below.
Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?
There is roughly a one quarter probability that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution.
The significance level of this hypothesis test is roughly a quarter.
Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging at least 0.61 more assists in wins than losses is roughly a quarter.
Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging 0.61 more assists in wins than losses is roughly a quarter.
Answer: Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging at least 0.61 more assists in wins than losses is roughly a quarter. (Option 3)
First, we should note that the area to the left of the red line (a quarter) is the p-value of our hypothesis test. Generally, the p-value is the probability of observing an outcome as or more extreme than the observed, under the assumption that the null hypothesis is true. The direction to look in depends on the alternate hypothesis; here, since our alternative hypothesis is that the number of assists Kelsey Plum makes in winning games is higher on average than in losing games, a “more extreme” outcome is where the assists in winning games are higher than in losing games, i.e. where \text{(assists in wins)} - \text{(assists in losses)} is positive or where \text{(assists in losses)} - \text{(assists in wins)} is negative. As mentioned in the solution to the first subpart, our test statistic is \text{(assists in losses)} - \text{(assists in wins)}, so a more extreme outcome is one where this is negative, i.e. to the left of the observed statistic.
Let’s first rule out the first two options.
Now, the only difference between Options 3 and 4 is the inclusion of “at least” in Option 3. Remember, to compute a p-value we must compute the probability of observing something as or more extreme than the observed, under the null. The “or more” corresponds to “at least” in Option 3. As such, Option 3 is the correct choice.
The average score on this problem was 70%.
An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
app_data
has a row for each product build that was logged
on the app. The column 'product'
contains the name of the
product, and the column 'minutes'
contains integer values
representing the number of minutes it took to assemble each product.
You are browsing the IKEA showroom, deciding whether to purchase the
BILLY bookcase or the LOMMARP bookcase. You are concerned about the
amount of time it will take to assemble your new bookcase, so you look
up the assembly times reported in app_data
. Thinking of the
data in app_data
as a random sample of all IKEA purchases,
you want to perform a permutation test to test the following
hypotheses.
Null Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from the same distribution.
Alternative Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from different distributions.
Suppose we query app_data
to keep only the BILLY
bookcases, then average the 'minutes'
column. In addition,
we separately query app_data
to keep only the LOMMARP
bookcases, then average the 'minutes'
column. If the null
hypothesis is true, which of the following statements about these two
averages is correct?
These two averages are the same.
Any difference between these two averages is due to random chance.
Any difference between these two averages cannot be ascribed to random chance alone.
The difference between these averages is statistically significant.
Answer: Any difference between these two averages is due to random chance.
If the null hypothesis is true, this means that the time recorded in
app_data
for each BILLY bookcase is a random number that
comes from some distribution, and the time recorded in
app_data
for each LOMMARP bookcase is a random number that
comes from the same distribution. Each assembly time is a
random number, so even if the null hypothesis is true, if we take one
person who assembles a BILLY bookcase and one person who assembles a
LOMMARP bookcase, there is no guarantee that their assembly times will
match. Their assembly times might match, or they might be different,
because assembly time is random. Randomness is the only reason that
their assembly times might be different, as the null hypothesis says
there is no systematic difference in assembly times between the two
bookcases. Specifically, it’s not the case that one typically takes
longer to assemble than the other.
With those points in mind, let’s go through the answer choices.
The first answer choice is incorrect. Just because two sets of
numbers are drawn from the same distribution, the numbers themselves
might be different due to randomness, and the averages might also be
different. Maybe just by chance, the people who assembled the BILLY
bookcases and recorded their times in app_data
were slower
on average than the people who assembled LOMMARP bookcases. If the null
hypothesis is true, this difference in average assembly time should be
small, but it very likely exists to some degree.
The second answer choice is correct. If the null hypothesis is true, the only reason for the difference is random chance alone.
The third answer choice is incorrect for the same reason that the second answer choice is correct. If the null hypothesis is true, any difference must be explained by random chance.
The fourth answer choice is incorrect. If there is a difference between the averages, it should be very small and not statistically significant. In other words, if we did a hypothesis test and the null hypothesis was true, we should fail to reject the null.
The average score on this problem was 77%.
For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.
Complete the code below to generate one simulated value of the test
statistic in a new way, without using
np.random.permutation
.
= (app_data.get('product') ==
billy 'BILLY Bookcase, white, 31 1/2x11x79 1/2')
= (app_data.get('product') ==
lommarp 'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
= app_data[billy|lommarp]
billy_lommarp = np.random.choice(billy_lommarp.get('minutes'), billy.sum(), replace=False).mean()
billy_mean = _________
lommarp_mean - lommarp_mean billy_mean
What goes in the blank?
billy_lommarp[lommarp].get('minutes').mean()
np.random.choice(billy_lommarp.get('minutes'), lommarp.sum(), replace=False).mean()
billy_lommarp.get('minutes').mean() - billy_mean
(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()
Answer:
(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()
The first line of code creates a boolean Series with a True value for
every BILLY bookcase, and the second line of code creates the analogous
Series for the LOMMARP bookcase. The third line queries to define a
DataFrame called billy_lommarp
containing all products that
are BILLY or LOMMARP bookcases. In other words, this DataFrame contains
a mix of BILLY and LOMMARP bookcases.
From this point, the way we would normally proceed in a permutation
test would be to use np.random.permutation
to shuffle one
of the two relevant columns (either 'product'
or
'minutes'
) to create a random pairing of assembly times
with products. Then we would calculate the average of all assembly times
that were randomly assigned to the label BILLY. Similarly, we’d
calculate the average of all assembly times that were randomly assigned
to the label LOMMARP. Then we’d subtract these averages to get one
simulated value of the test statistic. To run the permutation test, we’d
have to repeat this process many times.
In this problem, we need to generate a simulated value of the test
statistic, without randomly shuffling one of the columns. The code
starts us off by defining a variable called billy_mean
that
comes from using np.random.choice
. There’s a lot going on
here, so let’s break it down. Remember that the first argument to
np.random.choice
is a sequence of values to choose from,
and the second is the number of random choices to make. And we set
replace=False
, so that no element that has already been
chosen can be chosen again. Here, we’re making our random choices from
the 'minutes'
column of billy_lommarp
. The
number of choices to make from this collection of values is
billy.sum()
, which is the sum of all values in the
billy
Series defined in the first line of code. The
billy
Series contains True/False values, but in Python,
True counts as 1 and False counts as 0, so billy.sum()
evaluates to the number of True entries in billy
, which is
the number of BILLY bookcases recorded in app_data
. It
helps to think of the random process like this:
If we think of the random times we draw as being labeled BILLY, then the remaining assembly times still leftover in the bag represent the assembly times randomly labeled LOMMARP. In other words, this is a random association of assembly times to labels (BILLY or LOMMARP), which is the same thing we usually accomplish by shuffling in a permutation test.
From here, we can proceed the same way as usual. First, we need to
calculate the average of all assembly times that were randomly assigned
to the label BILLY. This is done for us and stored in
billy_mean
. We also need to calculate the average of all
assembly times that were randomly assigned the label LOMMARP. We’ll call
that lommarp_mean
. Thinking of picking times out of a large
bag, this is the average of all the assembly times left in the bag. The
problem is there is no easy way to access the assembly times that were
not picked. We can take advantage of the fact that we can easily
calculate the total assembly time of all BILLY and LOMMARP bookcases
together with billy_lommarp.get('minutes').sum()
. Then if
we subtract the total assembly time of all bookcases randomly labeled
BILLY, we’ll be left with the total assembly time of all bookcases
randomly labeled LOMMARP. That is,
billy_lommarp.get('minutes').sum() - billy_mean * billy.sum()
represents the total assembly time of all bookcases randomly labeled
LOMMARP. The count of the number of LOMMARP bookcases is given by
lommarp.sum()
so the average is
(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()
.
A common wrong answer for this question was the second answer choice,
np.random.choice(billy_lommarp.get('minutes'), lommarp.sum(), replace=False).mean()
.
This mimics the structure of how billy_mean
was defined so
it’s a natural guess. However, this corresponds to the following random
process, which doesn’t associate each assembly with a unique label
(BILLY or LOMMARP):
We could easily get the same assembly time once for BILLY and once for LOMMARP, while other assembly times could get picked for neither. This process doesn’t split the data into two random groups as desired.
The average score on this problem was 12%.
The DataFrame apps
contains application data for a
random sample of 1,000 applicants for a particular credit card from the
1990s. The columns are:
"status" (str)
: Whether the credit card application
was approved: "approved"
or "denied"
values
only.
"age" (float)
: The applicant’s age, in years, to the
nearest twelfth of a year.
"income" (float)
: The applicant’s annual income, in
tens of thousands of dollars.
"homeowner" (str)
: Whether the credit card applicant
owns their own home: "yes"
or "no"
values
only.
"dependents" (int)
: The number of dependents, or
individuals that rely on the applicant as a primary source of income,
such as children.
The first few rows of apps
are shown below, though
remember that apps
has 1,000 rows.
In apps
, our sample of 1,000 credit card applications,
applicants who were approved for the credit card have fewer dependents,
on average, than applicants who were denied. The mean number of
dependents for approved applicants is 0.98, versus 1.07 for denied
applicants.
To test whether this difference is purely due to random chance, or whether the distributions of the number of dependents for approved and denied applicants are truly different in the population of all credit card applications, we decide to perform a permutation test.
Consider the incomplete code block below.
def shuffle_status(df):
= np.random.permutation(df.get("status"))
shuffled_status return df.assign(status=shuffled_status).get(["status", "dependents"])
def test_stat(df):
= df.groupby("status").mean().get("dependents")
grouped = grouped.loc["approved"]
approved = grouped.loc["denied"]
denied return __(a)__
= np.array([])
stats for i in np.arange(10000):
= shuffle_status(apps)
shuffled_apps = test_stat(shuffled_apps)
stat = np.append(stats, stat)
stats
= np.count_nonzero(__(b)__) / 10000 p_value
Below are six options for filling in blanks (a) and (b) in the code above.
Blank (a) | Blank (b) | |
---|---|---|
Option 1 | denied - approved |
stats >= test_stat(apps) |
Option 2 | denied - approved |
stats <= test_stat(apps) |
Option 3 | approved - denied |
stats >= test_stat(apps) |
Option 4 | np.abs(denied - approved) |
stats >= test_stat(apps) |
Option 5 | np.abs(denied - approved) |
stats <= test_stat(apps) |
Option 6 | np.abs(approved - denied) |
stats >= test_stat(apps) |
The correct way to fill in the blanks depends on how we choose our null and alternative hypotheses.
Suppose we choose the following pair of hypotheses.
Null Hypothesis: In the population, the number of dependents of approved and denied applicants come from the same distribution.
Alternative Hypothesis: In the population, the number of dependents of approved applicants and denied applicants do not come from the same distribution.
Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.
Option 1
Option 2
Option 3
Option 4
Option 5
Option 6
None of the above.
Answer: Option 4, Option 6
For blank (a), we want to choose a test statistic that helps us
distinguish between the null and alternative hypotheses. The alternative
hypothesis says that denied
and approved
should be different, but it doesn’t say which should be larger. Options
1 through 3 therefore won’t work, because high values and low values of
these statistics both point to the alternative hypothesis, and moderate
values point to the null hypothesis. Options 4 through 6 all work
because large values point to the alternative hypothesis, and small
values close to 0 suggest that the null hypothesis should be true.
For blank (b), we want to calculate the p-value in such a way that it
represents the proportion of trials for which the simulated test
statistic was equal to the observed statistic or further in the
direction of the alternative. For all of Options 4 through 6, large
values of the test statistic indicate the alternative, so we need to
calculate the p-value with a >=
sign, as in Options 4
and 6.
While Option 3 filled in blank (a) correctly, it did not fill in blank (b) correctly. Options 4 and 6 fill in both blanks correctly.
The average score on this problem was 78%.
Now, suppose we choose the following pair of hypotheses.
Null Hypothesis: In the population, the number of dependents of approved and denied applicants come from the same distribution.
Alternative Hypothesis: In the population, the number of dependents of approved applicants is smaller on average than the number of dependents of denied applicants.
Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.
Answer: Option 1
As in the previous part, we need to fill blank (a) with a test
statistic such that large values point towards one of the hypotheses and
small values point towards the other. Here, the alterntive hypothesis
suggests that approved
should be less than
denied
, so we can’t use Options 4 through 6 because these
can only detect whether approved
and denied
are not different, not which is larger. Any of Options 1 through 3
should work, however. For Options 1 and 2, large values point towards
the alternative, and for Option 3, small values point towards the
alternative. This means we need to calculate the p-value in blank (b)
with a >=
symbol for the test statistic from Options 1
and 2, and a <=
symbol for the test statistic from
Option 3. Only Options 1 fills in blank (b) correctly based on the test
statistic used in blank (a).
The average score on this problem was 83%.
Option 6 from the start of this question is repeated below.
Blank (a) | Blank (b) | |
---|---|---|
Option 6 | np.abs(approved - denied) |
stats >= test_stat(apps) |
We want to create a new option, Option 7, that replicates the behavior of Option 6, but with blank (a) filled in as shown:
Blank (a) | Blank (b) | |
---|---|---|
Option 7 | approved - denied |
Which expression below could go in blank (b) so that Option 7 is equivalent to Option 6?
np.abs(stats) >= test_stat(apps)
stats >= np.abs(test_stat(apps))
np.abs(stats) >= np.abs(test_stat(apps))
np.abs(stats >= test_stat(apps))
Answer:
np.abs(stats) >= np.abs(test_stat(apps))
First, we need to understand how Option 6 works. Option 6 produces
large values of the test statistic when approved
is very
different from denied
, then calculates the p-value as the
proportion of trials for which the simulated test statistic was larger
than the observed statistic. In other words, Option 6 calculates the
proportion of trials in which approved
and
denied
are more different in a pair of random samples than
they are in the original samples.
For Option 7, the test statistic for a pair of random samples may
come out very large or very small when approved
is very
different from denied
. Similarly, the observed statistic
may come out very large or very small when approved
and
denied
are very different in the original samples. We want
to find the proportion of trials in which approved
and
denied
are more different in a pair of random samples than
they are in the original samples, which means we want the proportion of
trials in which the absolute value of approved - denied
in
a pair of random samples is larger than the absolute value of
approved - denied
in the original samples.
The average score on this problem was 56%.
In our implementation of this permutation test, we followed the
procedure outlined in lecture to draw new pairs of samples under the
null hypothesis and compute test statistics — that is, we randomly
assigned each row to a group (approved or denied) by shuffling one of
the columns in apps
, then computed the test statistic on
this random pair of samples.
Let’s now explore an alternative solution to drawing pairs of samples under the null hypothesis and computing test statistics. Here’s the approach:
"dependents"
column as the new “denied” sample, and the values at the at the bottom
of the resulting "dependents"
column as the new “approved”
sample. Note that we don’t necessarily split the DataFrame exactly in
half — the sizes of these new samples depend on the number of “denied”
and “approved” values in the original DataFrame!Once we generate our pair of random samples in this way, we’ll compute the test statistic on the random pair, as usual. Here, we’ll use as our test statistic the difference between the mean number of dependents for denied and approved applicants, in the order denied minus approved.
Fill in the blanks to complete the simulation below.
Hint: np.random.permutation
shouldn’t appear
anywhere in your code.
def shuffle_all(df):
'''Returns a DataFrame with the same rows as df, but reordered.'''
return __(a)__
def fast_stat(df):
# This function does not and should not contain any randomness.
= np.count_nonzero(df.get("status") == "denied")
denied = __(b)__.get("dependents").mean()
mean_denied = __(c)__.get("dependents").mean()
mean_approved return mean_denied - mean_approved
= np.array([])
stats for i in np.arange(10000):
= fast_stat(shuffle_all(apps))
stat = np.append(stats, stat) stats
Answer: The blanks should be filled in as follows:
df.sample(df.shape[0])
df.take(np.arange(denied))
df.take(np.arange(denied, df.shape[0]))
For blank (a), we are told to return a DataFrame with the same rows
but in a different order. We can use the .sample
method for
this question. We want each row of the input DataFrame df
to appear once, so we should sample without replacement, and we should
have has many rows in the output as in df
, so our sample
should be of size df.shape[0]
. Since sampling without
replacement is the default behavior of .sample
, it is
optional to specify replace=False
.
The average score on this problem was 59%.
For blank (b), we need to implement the strategy outlined, where
after we shuffle the DataFrame, we use the values at the top of the
DataFrame as our new “denied sample. In a permutation test, the two
random groups we create should have the same sizes as the two original
groups we are given. In this case, the size of the”denied” group in our
original data is stored in the variable denied
. So we need
the rows in positions 0, 1, 2, …, denied - 1
, which we can
get using df.take(np.arange(denied))
.
The average score on this problem was 39%.
For blank (c), we need to get all remaining applicants, who form the
new “approved” sample. We can .take
the rows corresponding
to the ones we didn’t put into the “denied” group. That is, the first
applicant who will be put into this group is at position
denied
, and we’ll take all applicants from there onwards.
We should therefore fill in blank (c) with
df.take(np.arange(denied, df.shape[0]))
.
For example, if apps
had only 10 rows, 7 of them
corresponding to denied applications, we would shuffle the rows of
apps
, then take rows 0, 1, 2, 3, 4, 5, 6 as our new
“denied” sample and rows 7, 8, 9 as our new “approved” sample.
The average score on this problem was 38%.
Choose the best tool to answer each of the following questions. Note the following:
Are incomes of applicants with 2 or fewer dependents drawn randomly from the distribution of incomes of all applicants?
Hypothesis Testing
Permutation Testing
Bootstrapping
Anwser: Hypothesis Testing
This is a question of whether a certain set of incomes (corresponding to applicants with 2 or fewer dependents) are drawn randomly from a certain population (incomes of all applicants). We need to use hypothesis testing to determine whether this model for how samples are drawn from a population seems plausible.
The average score on this problem was 47%.
What is the median income of credit card applicants with 2 or fewer dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping
Anwser: Bootstrapping
The question is looking for an estimate a specific parameter (the median income of applicants with 2 or fewer dependents), so we know boostrapping is the best tool.
The average score on this problem was 88%.
Are credit card applications approved through a random process in which 50% of applications are approved?
Hypothesis Testing
Permutation Testing
Bootstrapping
Anwser: Hypothesis Testing
The question asks about the validity of a model in which applications are approved randomly such that each application has a 50% chance of being approved. To determine whether this model is plausible, we should use a standard hypothesis test to simulate this random process many times and see if the data generated according to this model is consistent with our observed data.
The average score on this problem was 74%.
Is the median income of applicants with 2 or fewer dependents less than the median income of applicants with 3 or more dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping
Anwser: Permutation Testing
Recall, a permutation test helps us decide whether two random samples come from the same distribution. This question is about whether two random samples for different groups of applicants have the same distribution of incomes or whether they don’t because one group’s median incomes is less than the other.
The average score on this problem was 57%.
What is the difference in median income of applicants with 2 or fewer dependents and applicants with 3 or more dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping
Anwser: Bootstrapping
The question at hand is looking for a specific parameter value (the difference in median incomes for two different subsets of the applicants). Since this is a question of estimating an unknown parameter, bootstrapping is the best tool.
The average score on this problem was 63%.
In this question, we’ll explore the relationship between the ages and incomes of credit card applicants.
The credit card company that owns the data in apps, BruinCard, has
decided not to give us access to the entire apps
DataFrame,
but instead just a sample of apps
called
small apps
. We’ll start by using the information in
small_apps
to compute the regression line that predicts the
age of an applicant given their income.
For an applicant with an income that is \frac{8}{3} standard deviations above the
mean income, we predict their age to be \frac{4}{5} standard deviations above the
mean age. What is the correlation coefficient, r, between incomes and ages in
small_apps
? Give your answer as a fully simplified
fraction.
Answer: r = \frac{3}{10}
To find the correlation coefficient r we use the equation of the regression line in standard units and solve for r as follows. \begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ \frac{4}{5} &= r \cdot \frac{8}{3} \\ r &= \frac{4}{5} \cdot \frac{3}{8} \\ r &= \frac{3}{10} \end{align*}
The average score on this problem was 52%.
Now, we want to predict the income of an applicant given their age.
We will again use the information in small_apps
to find the
regression line. The regression line predicts that an applicant whose
age is \frac{4}{5} standard deviations
above the mean age has an income that is s standard deviations above the mean income.
What is the value of s? Give your
answer as a fully simplified fraction.
Answer: s = \frac{6}{25}
We again use the equation of the regression line in standard units, with the value of r we found in the previous part. \begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ s &= \frac{3}{10} \cdot \frac{4}{5} \\ s &= \frac{6}{25} \end{align*}
Notice that when we predict income based on age, our predictions are different than when we predict age based on income. That is, the answer to this question is not \frac{8}{3}. We can think of this phenomenon as a consequence of regression to the mean which means that the predicted variable is always closer to average than the original variable. In part (a), we start with an income of \frac{8}{3} standard units and predict an age of \frac{4}{5} standard units, which is closer to average than \frac{8}{3} standard units. Then in part (b), we start with an age of \frac{4}{5} and predict an income of \frac{6}{25} standard units, which is closer to average than \frac{4}{5} standard units. This happens because whenever we make a prediction, we multiply by r which is less than one in magnitude.
The average score on this problem was 21%.
BruinCard has now taken away our access to both apps
and
small_apps
, and has instead given us access to an even
smaller sample of apps
called mini_apps
. In
mini_apps
, we know the following information: - All incomes
and ages are positive numbers. - There is a positive linear association
between incomes and ages.
We use the data in mini_apps
to find the regression line
that will allow us to predict the income of an applicant given their
age. Just to test the limits of this regression line, we use it to
predict the income of an applicant who is -2 years old,
even though it doesn’t make sense for a person to have a negative
age.
Let I be the regression line’s prediction of this applicant’s income. Which of the following inequalities are guaranteed to be satisfied? Select all that apply.
I < 0
I < \text{mean income}
| I - \text{mean income}| \leq | \text{mean age} + 2 |
\dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} \leq \dfrac{| \text{mean age} + 2 |}{\text{standard deviation of ages}}
None of the above.
Answer: I < \text{mean income}, \dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} \leq \dfrac{| \text{mean age} + 2 |}{\text{standard deviation of ages}}
To understand this answer, we will investigate each option.
This option asks whether income is guaranteed to be negative. This is not necessarily true. For example, it’s possible that the slope of the regression line is 2 and the intercept is 10, in which case the income associated with a -2 year old would be 6, which is positive.
This option asks whether the predicted income is guaranteed to be lower than the mean income. It helps to think in standard units. In standard units, the regression line goes through the point (0, 0) and has slope r, which we are told is positive. This means that for a below-average x, the predicted y is also below average. So this statement must be true.
First, notice that | \text{mean age} + 2 | = | -2 - \text{mean age}|, which represents the horizontal distance betweeen these two points on the regression line: (\text{mean age}, \text{mean income}), (-2, I). Likewise, | I - \text{mean income}| represents the vertical distance between those same two points. So the inequality can be interpreted as a question of whether the rise of the regression line is less than or equal to the run, or whether the slope is at most 1. That’s not guaranteed when we’re working in original units, as we are here, so this option is not necessarily true.
Since standard deviation cannot be negative, we have \dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} = \left| \dfrac{I - \text{mean income}}{\text{standard deviation of incomes}} \right| = I_{\text{(su)}}. Similarly, \dfrac{|\text{mean age} + 2|}{\text{standard deviation of ages}} = \left| \dfrac{-2 - \text{mean age}}{\text{standard deviation of ages}} \right| = -2_{\text{(su)}}. So this option is asking about whether the predicted income, in standard units, is guaranteed to be less (in absolute value) than the age. Since we make predictions in standard units using the equation of the regression line \text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}} and we know |r|\leq 1, this means |\text{predicted } y_{\text{(su)}}| \leq | x_{\text{(su)}}|. Applying this to ages (x) and incomes (y), this says exactly what the given inequality says. This is the phenomenon we call regression to the mean.
The average score on this problem was 69%.
Yet again, BruinCard, the company that gave us access to
apps
, small_apps
, and mini_apps
,
has revoked our access to those three DataFrames and instead has given
us micro_apps
, an even smaller sample of
apps
.
Using micro_apps
, we are again interested in finding the
regression line that will allow us to predict the income of an applicant
given their age. We are given the following information:
Suppose the standard deviation of incomes in micro_apps
is an integer multiple of the standard deviation of ages in
micro_apps
. That is,
\text{standard deviation of income} = k \cdot \text{standard deviation of age}.
What is the value of k? Give your answer as an integer.
Answer: k = 4
To find this answer, we’ll use the definition of the regression line in original units, which is \text{predicted } y = mx+b, where m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, \: \: b = \text{mean of } y - m \cdot \text{mean of } x
Next we substitute these value for m and b into \text{predicted } y = mx + b, interpret x as age and y as income, and use the given information to find k. \begin{align*} \text{predicted } y &= mx+b \\ \text{predicted } y &= r \cdot \frac{\text{SD of } y}{\text{SD of }x} \cdot x+ \text{mean of } y - r \cdot \frac{\text{SD of } y}{\text{SD of }x} \cdot \text{mean of } x\\ \text{predicted income}&= r \cdot \frac{\text{SD of income}}{\text{SD of age}} \cdot \text{age}+ \text{mean income} - r \cdot \frac{\text{SD of income}}{\text{SD of age}} \cdot \text{mean age} \\ \frac{31}{2}&= -\frac{1}{3} \cdot k \cdot 24+ \frac{7}{2} + \frac{1}{3} \cdot k \cdot 33 \\ \frac{31}{2}&= -8k+ \frac{7}{2} + 11k \\ \frac{31}{2}&= 3k+ \frac{7}{2} \\ 3k &= \frac{31}{2} - \frac{7}{2} \\ 3k &= 12 \\ k &= 4 \end{align*}
Another way to solve this problem uses the equation of the regression line in standard units and the definition of standard units.
\begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ \frac{\text{predicted income} - \text{mean income}}{\text{SD of income}} &= r \cdot \frac{\text{age} - \text{mean age}}{\text{SD of age}} \\ \frac{\frac{31}{2} - \frac{7}{2}}{k\cdot \text{SD of age}} &= -\frac{1}{3} \cdot \frac{24 - 33}{\text{SD of age}} \\ \frac{12}{k\cdot \text{SD of age}} &= -\frac{1}{3} \cdot \frac{-9}{\text{SD of age}} \\ \frac{12}{k\cdot \text{SD of age}} &= \frac{3}{\text{SD of age}} \\ \frac{k\cdot \text{SD of age}}{\text{SD of age}} &= \frac{12}{3}\\ k &= 4 \end{align*}
The average score on this problem was 45%.
Raine is helping settle a debate between two friends on the
“superior" season — winter or summer. In doing so, they try to
understand the relationship between the number of sunshine hours per
month in January and the number of sunshine hours per month in July
across all cities in California in sun
.
Raine finds the regression line that predicts the number of sunshine hours in July (y) for a city given its number of sunshine hours in January (x). In doing so, they find that the correlation between the two variables is \frac{2}{5}.
Which of these could be a scatter plot of number of sunshine hours in July vs. number of sunshine hours in January?
Option 1
Option 2
Option 3
Option 4
Answer: Option 1
Since r = \frac{2}{5}, the correct option must be a scatter plot with a mild positive (up and to the right) linear association. Option 3 can be ruled out immediately, since the linear association in it is negative (down and to the right). Option 2’s linear association is too strong for r = \frac{2}{5}, and Option 4’s linear association is too weak for r = \frac{2}{5}, which leaves Option 1.
The average score on this problem was 57%.
Suppose the standard deviation of the number of sunshine hours in January for cities in California is equal to the standard deviation of the number of sunshine hours in July for cities in California.
Raine’s hometown of Santa Clarita saw 60 more sunshine hours in January than the average California city did. How many more sunshine hours than average does the regression line predict that Santa Clarita will have in July? Give your answer as a positive integer. (Hint: You’ll need to use the fact that the correlation between the two variables is \frac{2}{5}.)
Answer: 24
At a high level, we’ll start with the formula for the regression line in standard units, and re-write it in a form that will allow us to use the information provided to us in the question.
Recall, the regression line in standard units is
\text{predicted }y_{\text{(su)}} = r \cdot x_{\text{(su)}}
Using the definitions of \text{predicted }y_{\text{(su)}} and x_{\text{(su)}} gives us
\frac{\text{predicted } y - \text{mean of }y}{\text{SD of }y} = r \cdot \frac{x - \text{mean of }x}{\text{SD of }x}
Here, the x variable is sunshine hours in January and the y variable is sunshine hours in July. Given that the standard deviation of January and July sunshine hours are equal, we can simplifies our formula to
\text{predicted } y - \text{mean of }y = r \cdot (x - \text{mean of }x)
Since we’re asked how much more sunshine Santa Clarita will have in July compared to the average, we’re interested in the difference y - \text{mean of} y. We were given that Santa Clarita had 60 more sunshine hours in January than the average, and that the correlation between the two variables(correlation coefficient) is \frac{2}{5}. In terms of the variables above, then, we know:
x - \text{mean of }x = 60.
r = \frac{2}{5}.
Then,
\text{predicted } y - \text{mean of }y = r \cdot (x - \text{mean of }x) = \frac{2}{5} \cdot 60 = 24
Therefore, the regression line predicts that Santa Clarita will have 24 more sunshine hours than the average California city in July.
The average score on this problem was 68%.
As we know, San Diego was particularly cloudy this May. More generally, Anthony, another California native, feels that California is getting cloudier and cloudier overall.
To imagine what the dataset may look like in a few years, Anthony subtracts 5 from the number of sunshine hours in both January and July for all California cities in the dataset – i.e., he subtracts 5 from each x value and 5 from each y value in the dataset. He then creates a regression line to use the new xs to predict the new ys.
What is the slope of Anthony’s new regression line?
Answer: \frac{2}{5}
To determine the slope of Anthony’s new regression line, we need to understand how the modifications he made to the dataset (subtracting 5 hours from each x and y value) affect the slope. In simple linear regression, the slope of the regression line (m in y = mx + b) is calculated using the formula:
m = r \cdot \frac{\text{SD of y}}{\text{SD of x}}
r, the correlation coefficient between the two variables, remains unchanged in Anthony’s modifications. Remember, the correlation coefficient is the mean of the product of the x values and y values when both are measured in standard units; by subtracting the same constant amount from each x value, we aren’t changing what the x values convert to in standard units. If you’re not convinced, convert the following two arrays in Python to standard units; you’ll see that the results are the same.
= np.array([5, 8, 4, 2, 9])
x1 = x1 - 5 x2
Furthermore, Anthony’s modifications also don’t change the standard deviations of the x values or y values, since the xs and ys aren’t any more or less spread out after being shifted “down” by 5. So, since r, \text{SD of }y, and \text{SD of }x are all unchanged, the slope of the new regression line is the same as the slope of the old regression line, pre-modification!
Given the fact that the correlation coefficient is \frac{2}{5} and the standard deviation of sunshine hours in January (\text{SD of }x) is equal to the standard deviation of sunshine hours in July (\text{SD of }y), we have
m = r \cdot \frac{\text{SD of }y}{\text{SD of }x} = \frac{2}{5} \cdot 1 = \frac{2}{5}
The average score on this problem was 73%.
Suppose the intercept of Raine’s original regression line – that is, before Anthony subtracted 5 from each x and each y – was 10. What is the intercept of Anthony’s new regression line?
-7
-5
-3
0
3
5
7
Answer: 7
Let’s denote the original intercept as b and the new intercept in the new dataset as b'. The equation for the original regression line is y = mx + b, where:
When Anthony subtracts 5 from each x and y value, the new regression line becomes y - 5 = m \cdot (x - 5) + b'
Expanding and rearrange this equation, we have
y = mx - 5m + 5 + b'
Remember, x and y here represent the number of sunshine hours in January and July, respectively, before Anthony subtracted 5 from each number of hours. This means that the equation for y above is equivalent to y = mx + b. Comparing, we see that
-5m + 5 + b' = b
Since m = \frac{2}{5} (from the previous part) and b = 10, we have
-5 \cdot \frac{2}{5} + 5 + b' = 10 \implies b' = 10 - 5 + 2 = 7
Therefore, the intercept of Anthony’s new regression line is 7.
The average score on this problem was 34%.
Jasmine is trying to get as far away from Anthony as possible and has a trip to Chicago planned after finals. Chicago is known for being very warm and sunny in the summer but cold, rainy, and snowy in the winter. She decides to build a regression line that uses month of the year (where 1 is January, 2 is February, 12 is December, etc.) to predict the number of sunshine hours in Chicago.
The DataFrame games
contains information about a sample
of popular games. Besides other columns, there is a column
"Complexity"
that contains the average complexity of the
game, a column "Rating"
that contains the average rating of
the game, and a column "Play Time"
that contains the
average play time of the game.
We use the regression line to predict a game’s "Rating"
based on its "Complexity"
. We find that for the game
Wingspan, which has a "Complexity"
that is 2
points higher than the average, the predicted "Rating"
is 3
points higher than the average.
What can you conclude about the correlation coefficient r?
r < 0
r = 0
r > 0
We cannot make any conclusions about the value of r based on this information alone.
Answer: r > 0
To answer this problem, it’s useful to recall the regression line in standard units:
\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}
If a value is positive in standard units, it means that it is above
the average of the distribution that it came from, and if a value is
negative in standard units, it means that it is below the average of the
distribution that it came from. Since we’re told that Wingspan
has a "Complexity"
that is 2 points higher than the
average, we know that x_{\text{(su)}}
is positive. Since we’re told that the predicted "Rating"
is 3 points higher than the average, we know that \text{predicted } y_{\text{(su)}} must also
be positive. As a result, r must also
be positive, since you can’t multiply a positive number (x_{\text{(su)}}) by a negative number and end
up with another positive number.
The average score on this problem was 74%.
What can you conclude about the standard deviations of “Complexity” and “Rating”?
SD of "Complexity"
< SD of "Rating"
SD of "Complexity"
= SD of "Rating"
SD of "Complexity"
> SD of "Rating"
We cannot make any conclusions about the relationship between these two standard deviations based on this information alone.
Answer: SD of "Complexity"
< SD of
"Rating"
Since the distance of the predicted "Rating"
from its
average is larger than the distance of the "Complexity"
from its average, it might be reasonable to guess that the values in the
"Rating"
column are more spread out. This is true, but
let’s see concretely why that’s the case.
Let’s start with the equation of the regression line in standard
units from the previous subpart. Remember that here, x refers to "Complexity"
and
y refers to "Rating"
.
\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}
We know that to convert a value to standard units, we subtract the value by the mean of the column it came from, and divide by the standard deviation of the column it came from. As such, x_{\text{(su)}} = \frac{x - \text{mean of } x}{\text{SD of } x}. We can substitute this relationship in the regression line above, which gives us
\frac{\text{predicted } y - \text{mean of } y}{\text{SD of } y} = r \cdot \frac{x - \text{mean of } x}{\text{SD of } x}
To simplify things, let’s use what we were told. We were told that
the predicted "Rating"
was 3 points higher than average.
This means that the numerator of the left side, \text{predicted } y - \text{mean of } y, is
equal to 3. Similarly, we were told that the "Complexity"
was 2 points higher than average, so x -
\text{mean of } x is 2. Then, we have:
\frac{3}{\text{SD of } y} = \frac{2r}{\text{SD of }x}
Note that for convenience, we included r in the numerator on the right-hand side.
Remember that our goal is to compare the SD of "Rating"
(y) to the SD of
"Complexity"
(x). We now
have an equation that relates these two quantities! Since they’re both
currently on the denominator, which can be tricky to work with, let’s
take the reciprocal (i.e. “flip”) both fractions.
\frac{\text{SD of } y}{3} = \frac{\text{SD of }x}{2r}
Now, re-arranging gives us
\text{SD of } y \cdot \frac{2r}{3} = \text{SD of }x
Since we know that r is somewhere
between 0 and 1, we know that \frac{2r}{3} is somewhere between 0 and \frac{2}{3}. This means that \text{SD of } x is somewhere between 0 and
two-thirds of the value of \text{SD of }
y, which means that no matter what, \text{SD of } x < \text{SD of } y.
Remembering again that here "Complexity"
is our x and "Rating"
is our y, we have that the SD of
"Complexity"
is less than the SD of
"Rating"
.
The average score on this problem was 42%.
Suppose that for children’s games, "Play Time"
and
"Rating"
are negatively linearly associated due to children
having short attention spans. Suppose that for children’s games, the
standard deviation of "Play Time"
is twice the standard
deviation of "Rating"
, and the average
"Play Time"
is 10 minutes. We use linear regression to
predict the "Rating"
of a children’s game based on its
"Play Time"
. The regression line predicts that Don’t
Break the Ice, a children’s game with a "Play Time"
of
8 minutes will have a "Rating"
of 4. Which of the following
could be the average "Rating"
for children’s games?
2
2.8
3.1
4
Answer: 3.1
Let’s recall the formulas for the regression line in original units,
since we’re given information in original units in this question (such
as the fact that for a "Play Time"
of 8
minutes, the predicted "Rating"
is 4
stars). Remember that throughout this question, "Play Time"
is our x and "Rating"
is
our y.
The regression line is of the form y = mx + b, where
m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, b = \text{mean of }y - m \cdot \text{mean of } x
There’s a lot of information provided to us in the question – let’s think about what it means in the context of our xs and ys.
Given all of this information, we need to find possible values for the \text{mean of } y. Substituting our known values for m and b into y = mx + b gives us
y = \frac{r}{2} x + \text{mean of }y - 5r
Now, using the fact that if if x = 8, the predicted y is 4, we have
\begin{align*}4 &= \frac{r}{2} \cdot 8 + \text{mean of }y - 5r\\4 &= 4r - 5r + \text{mean of }y\\ 4 + r &= \text{mean of} y\end{align*}
Cool! We now know that the \text{mean of } y is 4 + r. We know that r must satisfy the relationship -1 \leq r < 0. By adding 4 to all pieces of this inequality, we have that 3 \leq r + 4 < 4, which means that 3 \leq \text{mean of } y < 4. Of the four options provided, only one is greater than or equal to 3 and less than 4, which is 3.1.
The average score on this problem was 55%.
The American Kennel Club (AKC) organizes information about dog
breeds. We’ve loaded their dataset into a DataFrame called
df
. The index of df
contains the dog breed
names as str
values. Besides other columns, there is a
column 'weight' (float)
that contains typical weight (kg)
and a column 'height' (float)
that contains typical height
(cm).
Sam wants to fit a linear model to predict a dog’s height
using its weight.
He first runs the following code:
= df.get('weight')
x = df.get('height')
y
def su(vals):
return (vals - vals.mean()) / np.std(vals)
Select all of the Python snippets that correctly compute the
correlation coefficient into the variable r
.
Snippet 1:
= (su(x) * su(y)).mean() r
Snippet 2:
= su(x * y).mean() r
Snippet 3:
= 0
t for i in range(len(x)):
= t + su(x[i]) * su(y[i])
t = t / len(x) r
Snippet 4:
= np.array([])
t for i in range(len(x)):
= np.append(t, su(x)[i] * su(y)[i])
t = t.mean() r
Snippet 1
Snippet 2
Snippet 3
Snippet 4
Answer: Snippet 1 & 4
Snippet 1: Recall from the reference sheet, the correlation
coefficient is r = (su(x) * su(y)).mean()
.
Snippet 2: We have to standardize each variable seperately so this snippet doesnt work.
Snippet 3: Note that for this snippet we’re standardizing each
data point within each variable seperately, and so we’re not really
standardizing the entire variable correctly. In other words, applying
su(x[i])
to a singular data point is just going to convert
this data point to zero, since we’re only inputting one data point into
su()
.
Snippet 4: Note that this code is just the same as Snippet 1, except we’re now directly computing the product of each corresponding data points individually. Hence this Snippet works.
The average score on this problem was 81%.
Sam computes the following statistics for his sample:
The best-fit line predicts that a dog with a weight of 10 kg has a height of 45 cm.
What is the SD of dog heights?
2
4.5
10
25
45
None of the above
Answer: Option 3: 10
The best fit line in original units are given by
y = mx + b
where m = r * (SD of y) / (SD of x)
and b = (mean of y) - m * (mean of x)
(refer to reference
sheet). Let c
be the STD of y, which we’re trying to find,
then our best fit line is now y = (0.8*c/8)x +
(50-(0.8*c/8)*15). Plugging the two values they gave us into our
best fit line and simplifying gives 45 =
0.1*c*10 + (50 - 1.5*c) which simplifies to 45 = 50 - 0.5*c which gives us an answer of
c = 10
.
The average score on this problem was 89%.
Assume that the statistics in part b) still hold. Select all of the statements below that are true. (You don’t need to finish part b) in order to solve this question.)
The relationship between dog weight and height is linear.
The root mean squared error of the best-fit line is smaller than 5.
The best-fit line predicts that a dog that weighs 15 kg will be 50 cm tall.
The best-fit line predicts that a dog that weighs 10 kg will be shorter than 50 cm.
Answer: Option 3 & 4
Option 1: We cannot determine whether two variables are linear simply from a line of best fit. The line of best fit just happens to find the best linear relationship between two varaibles, not whether or not the variables have a linear relationship.
Option 2: To calculate the root mean squared error, we need the actual data points so we can calculate residual values. Seeing that we don’t have access to the data points, we cannot say that the root mean squared error of the best-fit line is smaller than 5.
Option 3: This is true accrding to the problem statement given in part b
Option 4: This is true since we expect there to be a positive correlation between dog height and weight. So dogs that are lighter will also most likely be shorter. (ie a dog that is lighter than 15 kg will most likely be shorter than 50cm)
The average score on this problem was 72%.