Extra Practice Session 9: Permutation Testing and Regression

← return to practice.dsc10.com


Welcome! The problems shown below should be worked on on paper, since the quizzes and exams you take in this course will also be on paper. You do not need to submit your solutions anywhere.

We encourage you to complete this worksheet in groups during an extra practice session on Friday, March 8th. Solutions will be posted after all sessions have finished. This problem set is not designed to take any particular amount of time - focus on understanding concepts, not on getting through all the questions.


Problem 1

For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School, which is adjacent to UCSD’s campus. Her current team is the Las Vegas Aces (three-letter code 'LVA'). In 2021, the Las Vegas Aces played 31 games, and Kelsey Plum played in all 31.

The DataFrame plum contains her stats for all games the Las Vegas Aces played in 2021. The first few rows of plum are shown below (though the full DataFrame has 31 rows, not 5):

Each row in plum corresponds to a single game. For each game, we have:

Consider the definition of the function diff_in_group_means:

def diff_in_group_means(df, group_col, num_col):
    s = df.groupby(group_col).mean().get(num_col)
    return s.loc[False] - s.loc[True]


Problem 1.1

It turns out that Kelsey Plum averages 0.61 more assists in games that she wins (“winning games”) than in games that she loses (“losing games”). Fill in the blanks below so that observed_diff evaluates to -0.61.

observed_diff = diff_in_group_means(plum, __(a)__, __(b)__)
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answers: 'Won', 'AST'

To compute the number of assists Kelsey Plum averages in winning and losing games, we need to group by 'Won'. Once doing so, and using the .mean() aggregation method, we need to access elements in the 'AST' column.

The second argument to diff_in_group_means, group_col, is the column we’re grouping by, and so blank (a) must be filled by 'Won'. Then, the second argument, num_col, must be 'AST'.

Note that after extracting the Series containing the average number of assists in wins and losses, we are returning the value with the index False (“loss”) minus the value with the index True (“win”). So, throughout this problem, keep in mind that we are computing “losses minus wins”. Since our observation was that she averaged 0.61 more assists in wins than in losses, it makes sense that diff_in_group_means(plum, 'Won', 'AST') is -0.61 (rather than +0.61).


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 1.2

After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:

To conduct our permutation test, we place the following code in a for-loop.


won = plum.get('Won')
ast = plum.get('AST')
shuffled = plum.assign(Won_shuffled=np.random.permutation(won)) \
               .assign(AST_shuffled=np.random.permutation(ast))

Which of the following options does not compute a valid simulated test statistic for this permutation test?

Answer: diff_in_group_means(shuffled, 'Won', 'AST')

As we saw in the previous subpart, diff_in_group_means(shuffled, 'Won', 'AST') computes the observed test statistic, which is -0.61. There is no randomness involved in the observed test statistic; each time we run the line diff_in_group_means(shuffled, 'Won', 'AST') we will see the same result, so this cannot be used for simulation.

To perform a permutation test here, we need to simulate under the null by randomly assigning assist counts to groups; here, the groups are “win” and “loss”.

  • Option 2: Here, assist counts are shuffled and the group names are kept in the same order. The end result is a random pairing of assists to groups.
  • Option 3: Here, the group names are shuffled and the assist counts are kept in the same order. The end result is a random pairing of assist counts to groups.
  • Option 4: Here, both the group names and assist counts are shuffled, but the end result is still the same as in the previous two options.

As such, Options 2 through 4 are all valid, and Option 1 is the only invalid one.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


Problem 1.3

Suppose we generate 10,000 simulated test statistics, using one of the valid options from part 1. The empirical distribution of test statistics, with a red line at observed_diff, is shown below.

Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?

Answer: Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging at least 0.61 more assists in wins than losses is roughly a quarter. (Option 3)

First, we should note that the area to the left of the red line (a quarter) is the p-value of our hypothesis test. Generally, the p-value is the probability of observing an outcome as or more extreme than the observed, under the assumption that the null hypothesis is true. The direction to look in depends on the alternate hypothesis; here, since our alternative hypothesis is that the number of assists Kelsey Plum makes in winning games is higher on average than in losing games, a “more extreme” outcome is where the assists in winning games are higher than in losing games, i.e. where \text{(assists in wins)} - \text{(assists in losses)} is positive or where \text{(assists in losses)} - \text{(assists in wins)} is negative. As mentioned in the solution to the first subpart, our test statistic is \text{(assists in losses)} - \text{(assists in wins)}, so a more extreme outcome is one where this is negative, i.e. to the left of the observed statistic.

Let’s first rule out the first two options.

  • Option 1: This option states that the probability that the null hypothesis (the number of assists she makes in winning and losing games comes from the same distribution) is true is roughly a quarter. However, the p-value is not the probability that the null hypothesis is true.
  • Option 2: The significance level is the formal name for the p-value “cutoff” that we specify in our hypothesis test. There is no cutoff mentioned in the problem. The observed significance level is another name for the p-value, but Option 2 did not contain the word observed.

Now, the only difference between Options 3 and 4 is the inclusion of “at least” in Option 3. Remember, to compute a p-value we must compute the probability of observing something as or more extreme than the observed, under the null. The “or more” corresponds to “at least” in Option 3. As such, Option 3 is the correct choice.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.



Problem 2

An IKEA fan created an app where people can log the amount of time it took them to assemble their IKEA furniture. The DataFrame app_data has a row for each product build that was logged on the app. The column 'product' contains the name of the product, and the column 'minutes' contains integer values representing the number of minutes it took to assemble each product.

You are browsing the IKEA showroom, deciding whether to purchase the BILLY bookcase or the LOMMARP bookcase. You are concerned about the amount of time it will take to assemble your new bookcase, so you look up the assembly times reported in app_data. Thinking of the data in app_data as a random sample of all IKEA purchases, you want to perform a permutation test to test the following hypotheses.

Null Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from the same distribution.

Alternative Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from different distributions.


Problem 2.1

Suppose we query app_data to keep only the BILLY bookcases, then average the 'minutes' column. In addition, we separately query app_data to keep only the LOMMARP bookcases, then average the 'minutes' column. If the null hypothesis is true, which of the following statements about these two averages is correct?

Answer: Any difference between these two averages is due to random chance.

If the null hypothesis is true, this means that the time recorded in app_data for each BILLY bookcase is a random number that comes from some distribution, and the time recorded in app_data for each LOMMARP bookcase is a random number that comes from the same distribution. Each assembly time is a random number, so even if the null hypothesis is true, if we take one person who assembles a BILLY bookcase and one person who assembles a LOMMARP bookcase, there is no guarantee that their assembly times will match. Their assembly times might match, or they might be different, because assembly time is random. Randomness is the only reason that their assembly times might be different, as the null hypothesis says there is no systematic difference in assembly times between the two bookcases. Specifically, it’s not the case that one typically takes longer to assemble than the other.

With those points in mind, let’s go through the answer choices.

The first answer choice is incorrect. Just because two sets of numbers are drawn from the same distribution, the numbers themselves might be different due to randomness, and the averages might also be different. Maybe just by chance, the people who assembled the BILLY bookcases and recorded their times in app_data were slower on average than the people who assembled LOMMARP bookcases. If the null hypothesis is true, this difference in average assembly time should be small, but it very likely exists to some degree.

The second answer choice is correct. If the null hypothesis is true, the only reason for the difference is random chance alone.

The third answer choice is incorrect for the same reason that the second answer choice is correct. If the null hypothesis is true, any difference must be explained by random chance.

The fourth answer choice is incorrect. If there is a difference between the averages, it should be very small and not statistically significant. In other words, if we did a hypothesis test and the null hypothesis was true, we should fail to reject the null.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 2.2

For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.

Complete the code below to generate one simulated value of the test statistic in a new way, without using np.random.permutation.

billy = (app_data.get('product') == 
        'BILLY Bookcase, white, 31 1/2x11x79 1/2')
lommarp = (app_data.get('product') == 
          'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
billy_lommarp = app_data[billy|lommarp]
billy_mean = np.random.choice(billy_lommarp.get('minutes'), billy.sum(), replace=False).mean()
lommarp_mean = _________
billy_mean - lommarp_mean

What goes in the blank?

Answer: (billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()

The first line of code creates a boolean Series with a True value for every BILLY bookcase, and the second line of code creates the analogous Series for the LOMMARP bookcase. The third line queries to define a DataFrame called billy_lommarp containing all products that are BILLY or LOMMARP bookcases. In other words, this DataFrame contains a mix of BILLY and LOMMARP bookcases.

From this point, the way we would normally proceed in a permutation test would be to use np.random.permutation to shuffle one of the two relevant columns (either 'product' or 'minutes') to create a random pairing of assembly times with products. Then we would calculate the average of all assembly times that were randomly assigned to the label BILLY. Similarly, we’d calculate the average of all assembly times that were randomly assigned to the label LOMMARP. Then we’d subtract these averages to get one simulated value of the test statistic. To run the permutation test, we’d have to repeat this process many times.

In this problem, we need to generate a simulated value of the test statistic, without randomly shuffling one of the columns. The code starts us off by defining a variable called billy_mean that comes from using np.random.choice. There’s a lot going on here, so let’s break it down. Remember that the first argument to np.random.choice is a sequence of values to choose from, and the second is the number of random choices to make. And we set replace=False, so that no element that has already been chosen can be chosen again. Here, we’re making our random choices from the 'minutes' column of billy_lommarp. The number of choices to make from this collection of values is billy.sum(), which is the sum of all values in the billy Series defined in the first line of code. The billy Series contains True/False values, but in Python, True counts as 1 and False counts as 0, so billy.sum() evaluates to the number of True entries in billy, which is the number of BILLY bookcases recorded in app_data. It helps to think of the random process like this:

  1. Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
  2. Pull out a random assembly time from this bag.
  3. Repeat step 2, drawing as many times as there are BILLY bookcases, without replacement.

If we think of the random times we draw as being labeled BILLY, then the remaining assembly times still leftover in the bag represent the assembly times randomly labeled LOMMARP. In other words, this is a random association of assembly times to labels (BILLY or LOMMARP), which is the same thing we usually accomplish by shuffling in a permutation test.

From here, we can proceed the same way as usual. First, we need to calculate the average of all assembly times that were randomly assigned to the label BILLY. This is done for us and stored in billy_mean. We also need to calculate the average of all assembly times that were randomly assigned the label LOMMARP. We’ll call that lommarp_mean. Thinking of picking times out of a large bag, this is the average of all the assembly times left in the bag. The problem is there is no easy way to access the assembly times that were not picked. We can take advantage of the fact that we can easily calculate the total assembly time of all BILLY and LOMMARP bookcases together with billy_lommarp.get('minutes').sum(). Then if we subtract the total assembly time of all bookcases randomly labeled BILLY, we’ll be left with the total assembly time of all bookcases randomly labeled LOMMARP. That is, billy_lommarp.get('minutes').sum() - billy_mean * billy.sum() represents the total assembly time of all bookcases randomly labeled LOMMARP. The count of the number of LOMMARP bookcases is given by lommarp.sum() so the average is (billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum().

A common wrong answer for this question was the second answer choice, np.random.choice(billy_lommarp.get('minutes'), lommarp.sum(), replace=False).mean(). This mimics the structure of how billy_mean was defined so it’s a natural guess. However, this corresponds to the following random process, which doesn’t associate each assembly with a unique label (BILLY or LOMMARP):

  1. Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
  2. Pull out a random assembly time from this bag.
  3. Repeat, drawing as many times as there are BILLY bookcases, without replacement.
  4. Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
  5. Pull out a random assembly time from this bag.
  6. Repeat step 5, drawing as many times as there are LOMMARP bookcases, without replacement.

We could easily get the same assembly time once for BILLY and once for LOMMARP, while other assembly times could get picked for neither. This process doesn’t split the data into two random groups as desired.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 12%.



Problem 3

The DataFrame apps contains application data for a random sample of 1,000 applicants for a particular credit card from the 1990s. The columns are:

The first few rows of apps are shown below, though remember that apps has 1,000 rows.




In apps, our sample of 1,000 credit card applications, applicants who were approved for the credit card have fewer dependents, on average, than applicants who were denied. The mean number of dependents for approved applicants is 0.98, versus 1.07 for denied applicants.

To test whether this difference is purely due to random chance, or whether the distributions of the number of dependents for approved and denied applicants are truly different in the population of all credit card applications, we decide to perform a permutation test.

Consider the incomplete code block below.

def shuffle_status(df):
    shuffled_status = np.random.permutation(df.get("status"))
    return df.assign(status=shuffled_status).get(["status", "dependents"])

def test_stat(df):
    grouped = df.groupby("status").mean().get("dependents")
    approved = grouped.loc["approved"]
    denied = grouped.loc["denied"]
    return __(a)__

stats = np.array([])
for i in np.arange(10000):
    shuffled_apps = shuffle_status(apps)
    stat = test_stat(shuffled_apps)
    stats = np.append(stats, stat)

p_value = np.count_nonzero(__(b)__) / 10000

Below are six options for filling in blanks (a) and (b) in the code above.

Blank (a) Blank (b)
Option 1 denied - approved stats >= test_stat(apps)
Option 2 denied - approved stats <= test_stat(apps)
Option 3 approved - denied stats >= test_stat(apps)
Option 4 np.abs(denied - approved) stats >= test_stat(apps)
Option 5 np.abs(denied - approved) stats <= test_stat(apps)
Option 6 np.abs(approved - denied) stats >= test_stat(apps)

The correct way to fill in the blanks depends on how we choose our null and alternative hypotheses.


Problem 3.1

Suppose we choose the following pair of hypotheses.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.

Answer: Option 4, Option 6

For blank (a), we want to choose a test statistic that helps us distinguish between the null and alternative hypotheses. The alternative hypothesis says that denied and approved should be different, but it doesn’t say which should be larger. Options 1 through 3 therefore won’t work, because high values and low values of these statistics both point to the alternative hypothesis, and moderate values point to the null hypothesis. Options 4 through 6 all work because large values point to the alternative hypothesis, and small values close to 0 suggest that the null hypothesis should be true.

For blank (b), we want to calculate the p-value in such a way that it represents the proportion of trials for which the simulated test statistic was equal to the observed statistic or further in the direction of the alternative. For all of Options 4 through 6, large values of the test statistic indicate the alternative, so we need to calculate the p-value with a >= sign, as in Options 4 and 6.

While Option 3 filled in blank (a) correctly, it did not fill in blank (b) correctly. Options 4 and 6 fill in both blanks correctly.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 3.2

Now, suppose we choose the following pair of hypotheses.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.

Answer: Option 1

As in the previous part, we need to fill blank (a) with a test statistic such that large values point towards one of the hypotheses and small values point towards the other. Here, the alterntive hypothesis suggests that approved should be less than denied, so we can’t use Options 4 through 6 because these can only detect whether approved and denied are not different, not which is larger. Any of Options 1 through 3 should work, however. For Options 1 and 2, large values point towards the alternative, and for Option 3, small values point towards the alternative. This means we need to calculate the p-value in blank (b) with a >= symbol for the test statistic from Options 1 and 2, and a <= symbol for the test statistic from Option 3. Only Options 1 fills in blank (b) correctly based on the test statistic used in blank (a).


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 3.3

Option 6 from the start of this question is repeated below.

Blank (a) Blank (b)
Option 6 np.abs(approved - denied) stats >= test_stat(apps)

We want to create a new option, Option 7, that replicates the behavior of Option 6, but with blank (a) filled in as shown:

Blank (a) Blank (b)
Option 7 approved - denied

Which expression below could go in blank (b) so that Option 7 is equivalent to Option 6?

Answer: np.abs(stats) >= np.abs(test_stat(apps))

First, we need to understand how Option 6 works. Option 6 produces large values of the test statistic when approved is very different from denied, then calculates the p-value as the proportion of trials for which the simulated test statistic was larger than the observed statistic. In other words, Option 6 calculates the proportion of trials in which approved and denied are more different in a pair of random samples than they are in the original samples.

For Option 7, the test statistic for a pair of random samples may come out very large or very small when approved is very different from denied. Similarly, the observed statistic may come out very large or very small when approved and denied are very different in the original samples. We want to find the proportion of trials in which approved and denied are more different in a pair of random samples than they are in the original samples, which means we want the proportion of trials in which the absolute value of approved - denied in a pair of random samples is larger than the absolute value of approved - denied in the original samples.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.



Problem 3.4

In our implementation of this permutation test, we followed the procedure outlined in lecture to draw new pairs of samples under the null hypothesis and compute test statistics — that is, we randomly assigned each row to a group (approved or denied) by shuffling one of the columns in apps, then computed the test statistic on this random pair of samples.

Let’s now explore an alternative solution to drawing pairs of samples under the null hypothesis and computing test statistics. Here’s the approach:

  1. Shuffle, i.e. re-order, the rows of the DataFrame.
  2. Use the values at the top of the resulting "dependents" column as the new “denied” sample, and the values at the at the bottom of the resulting "dependents" column as the new “approved” sample. Note that we don’t necessarily split the DataFrame exactly in half — the sizes of these new samples depend on the number of “denied” and “approved” values in the original DataFrame!

Once we generate our pair of random samples in this way, we’ll compute the test statistic on the random pair, as usual. Here, we’ll use as our test statistic the difference between the mean number of dependents for denied and approved applicants, in the order denied minus approved.

Fill in the blanks to complete the simulation below.

Hint: np.random.permutation shouldn’t appear anywhere in your code.

    def shuffle_all(df):
        '''Returns a DataFrame with the same rows as df, but reordered.'''
        return __(a)__

    def fast_stat(df):
        # This function does not and should not contain any randomness.
        denied = np.count_nonzero(df.get("status") == "denied")
        mean_denied = __(b)__.get("dependents").mean()
        mean_approved = __(c)__.get("dependents").mean()
        return mean_denied - mean_approved

    stats = np.array([])
    for i in np.arange(10000):
        stat = fast_stat(shuffle_all(apps))
        stats = np.append(stats, stat)

Answer: The blanks should be filled in as follows:

    1. df.sample(df.shape[0])
    1. df.take(np.arange(denied))
    1. df.take(np.arange(denied, df.shape[0]))

For blank (a), we are told to return a DataFrame with the same rows but in a different order. We can use the .sample method for this question. We want each row of the input DataFrame df to appear once, so we should sample without replacement, and we should have has many rows in the output as in df, so our sample should be of size df.shape[0]. Since sampling without replacement is the default behavior of .sample, it is optional to specify replace=False.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.

For blank (b), we need to implement the strategy outlined, where after we shuffle the DataFrame, we use the values at the top of the DataFrame as our new “denied sample. In a permutation test, the two random groups we create should have the same sizes as the two original groups we are given. In this case, the size of the”denied” group in our original data is stored in the variable denied. So we need the rows in positions 0, 1, 2, …, denied - 1, which we can get using df.take(np.arange(denied)).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.

For blank (c), we need to get all remaining applicants, who form the new “approved” sample. We can .take the rows corresponding to the ones we didn’t put into the “denied” group. That is, the first applicant who will be put into this group is at position denied, and we’ll take all applicants from there onwards. We should therefore fill in blank (c) with df.take(np.arange(denied, df.shape[0])).

For example, if apps had only 10 rows, 7 of them corresponding to denied applications, we would shuffle the rows of apps, then take rows 0, 1, 2, 3, 4, 5, 6 as our new “denied” sample and rows 7, 8, 9 as our new “approved” sample.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 38%.



Problem 4

Choose the best tool to answer each of the following questions. Note the following:


Problem 4.1

Are incomes of applicants with 2 or fewer dependents drawn randomly from the distribution of incomes of all applicants?

Anwser: Hypothesis Testing

This is a question of whether a certain set of incomes (corresponding to applicants with 2 or fewer dependents) are drawn randomly from a certain population (incomes of all applicants). We need to use hypothesis testing to determine whether this model for how samples are drawn from a population seems plausible.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.


Problem 4.2

What is the median income of credit card applicants with 2 or fewer dependents?

Anwser: Bootstrapping

The question is looking for an estimate a specific parameter (the median income of applicants with 2 or fewer dependents), so we know boostrapping is the best tool.


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 4.3

Are credit card applications approved through a random process in which 50% of applications are approved?

Anwser: Hypothesis Testing

The question asks about the validity of a model in which applications are approved randomly such that each application has a 50% chance of being approved. To determine whether this model is plausible, we should use a standard hypothesis test to simulate this random process many times and see if the data generated according to this model is consistent with our observed data.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 4.4

Is the median income of applicants with 2 or fewer dependents less than the median income of applicants with 3 or more dependents?

Anwser: Permutation Testing

Recall, a permutation test helps us decide whether two random samples come from the same distribution. This question is about whether two random samples for different groups of applicants have the same distribution of incomes or whether they don’t because one group’s median incomes is less than the other.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 4.5

What is the difference in median income of applicants with 2 or fewer dependents and applicants with 3 or more dependents?

Anwser: Bootstrapping

The question at hand is looking for a specific parameter value (the difference in median incomes for two different subsets of the applicants). Since this is a question of estimating an unknown parameter, bootstrapping is the best tool.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.



Problem 5

In this question, we’ll explore the relationship between the ages and incomes of credit card applicants.


Problem 5.1

The credit card company that owns the data in apps, BruinCard, has decided not to give us access to the entire apps DataFrame, but instead just a sample of apps called small apps. We’ll start by using the information in small_apps to compute the regression line that predicts the age of an applicant given their income.

For an applicant with an income that is \frac{8}{3} standard deviations above the mean income, we predict their age to be \frac{4}{5} standard deviations above the mean age. What is the correlation coefficient, r, between incomes and ages in small_apps? Give your answer as a fully simplified fraction.

Answer: r = \frac{3}{10}

To find the correlation coefficient r we use the equation of the regression line in standard units and solve for r as follows. \begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ \frac{4}{5} &= r \cdot \frac{8}{3} \\ r &= \frac{4}{5} \cdot \frac{3}{8} \\ r &= \frac{3}{10} \end{align*}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 5.2

Now, we want to predict the income of an applicant given their age. We will again use the information in small_apps to find the regression line. The regression line predicts that an applicant whose age is \frac{4}{5} standard deviations above the mean age has an income that is s standard deviations above the mean income. What is the value of s? Give your answer as a fully simplified fraction.

Answer: s = \frac{6}{25}

We again use the equation of the regression line in standard units, with the value of r we found in the previous part. \begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ s &= \frac{3}{10} \cdot \frac{4}{5} \\ s &= \frac{6}{25} \end{align*}

Notice that when we predict income based on age, our predictions are different than when we predict age based on income. That is, the answer to this question is not \frac{8}{3}. We can think of this phenomenon as a consequence of regression to the mean which means that the predicted variable is always closer to average than the original variable. In part (a), we start with an income of \frac{8}{3} standard units and predict an age of \frac{4}{5} standard units, which is closer to average than \frac{8}{3} standard units. Then in part (b), we start with an age of \frac{4}{5} and predict an income of \frac{6}{25} standard units, which is closer to average than \frac{4}{5} standard units. This happens because whenever we make a prediction, we multiply by r which is less than one in magnitude.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 21%.


Problem 5.3

BruinCard has now taken away our access to both apps and small_apps, and has instead given us access to an even smaller sample of apps called mini_apps. In mini_apps, we know the following information: - All incomes and ages are positive numbers. - There is a positive linear association between incomes and ages.

We use the data in mini_apps to find the regression line that will allow us to predict the income of an applicant given their age. Just to test the limits of this regression line, we use it to predict the income of an applicant who is -2 years old, even though it doesn’t make sense for a person to have a negative age.

Let I be the regression line’s prediction of this applicant’s income. Which of the following inequalities are guaranteed to be satisfied? Select all that apply.

Answer: I < \text{mean income}, \dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} \leq \dfrac{| \text{mean age} + 2 |}{\text{standard deviation of ages}}

To understand this answer, we will investigate each option.

  • I < 0:

This option asks whether income is guaranteed to be negative. This is not necessarily true. For example, it’s possible that the slope of the regression line is 2 and the intercept is 10, in which case the income associated with a -2 year old would be 6, which is positive.

  • I < \text{mean income}:

This option asks whether the predicted income is guaranteed to be lower than the mean income. It helps to think in standard units. In standard units, the regression line goes through the point (0, 0) and has slope r, which we are told is positive. This means that for a below-average x, the predicted y is also below average. So this statement must be true.

  • | I - \text{mean income}| \leq | \text{mean age} + 2 |:

First, notice that | \text{mean age} + 2 | = | -2 - \text{mean age}|, which represents the horizontal distance betweeen these two points on the regression line: (\text{mean age}, \text{mean income}), (-2, I). Likewise, | I - \text{mean income}| represents the vertical distance between those same two points. So the inequality can be interpreted as a question of whether the rise of the regression line is less than or equal to the run, or whether the slope is at most 1. That’s not guaranteed when we’re working in original units, as we are here, so this option is not necessarily true.

  • \dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} \leq \dfrac{| \text{mean age} + 2 |}{\text{standard deviation of ages}}:

Since standard deviation cannot be negative, we have \dfrac{| I - \text{mean income}|}{\text{standard deviation of incomes}} = \left| \dfrac{I - \text{mean income}}{\text{standard deviation of incomes}} \right| = I_{\text{(su)}}. Similarly, \dfrac{|\text{mean age} + 2|}{\text{standard deviation of ages}} = \left| \dfrac{-2 - \text{mean age}}{\text{standard deviation of ages}} \right| = -2_{\text{(su)}}. So this option is asking about whether the predicted income, in standard units, is guaranteed to be less (in absolute value) than the age. Since we make predictions in standard units using the equation of the regression line \text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}} and we know |r|\leq 1, this means |\text{predicted } y_{\text{(su)}}| \leq | x_{\text{(su)}}|. Applying this to ages (x) and incomes (y), this says exactly what the given inequality says. This is the phenomenon we call regression to the mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 5.4

Yet again, BruinCard, the company that gave us access to apps, small_apps, and mini_apps, has revoked our access to those three DataFrames and instead has given us micro_apps, an even smaller sample of apps.

Using micro_apps, we are again interested in finding the regression line that will allow us to predict the income of an applicant given their age. We are given the following information:

Suppose the standard deviation of incomes in micro_apps is an integer multiple of the standard deviation of ages in micro_apps. That is,

\text{standard deviation of income} = k \cdot \text{standard deviation of age}.

What is the value of k? Give your answer as an integer.

Answer: k = 4

To find this answer, we’ll use the definition of the regression line in original units, which is \text{predicted } y = mx+b, where m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, \: \: b = \text{mean of } y - m \cdot \text{mean of } x

Next we substitute these value for m and b into \text{predicted } y = mx + b, interpret x as age and y as income, and use the given information to find k. \begin{align*} \text{predicted } y &= mx+b \\ \text{predicted } y &= r \cdot \frac{\text{SD of } y}{\text{SD of }x} \cdot x+ \text{mean of } y - r \cdot \frac{\text{SD of } y}{\text{SD of }x} \cdot \text{mean of } x\\ \text{predicted income}&= r \cdot \frac{\text{SD of income}}{\text{SD of age}} \cdot \text{age}+ \text{mean income} - r \cdot \frac{\text{SD of income}}{\text{SD of age}} \cdot \text{mean age} \\ \frac{31}{2}&= -\frac{1}{3} \cdot k \cdot 24+ \frac{7}{2} + \frac{1}{3} \cdot k \cdot 33 \\ \frac{31}{2}&= -8k+ \frac{7}{2} + 11k \\ \frac{31}{2}&= 3k+ \frac{7}{2} \\ 3k &= \frac{31}{2} - \frac{7}{2} \\ 3k &= 12 \\ k &= 4 \end{align*}

Another way to solve this problem uses the equation of the regression line in standard units and the definition of standard units.

\begin{align*} \text{predicted } y_{\text{(su)}} &= r \cdot x_{\text{(su)}} \\ \frac{\text{predicted income} - \text{mean income}}{\text{SD of income}} &= r \cdot \frac{\text{age} - \text{mean age}}{\text{SD of age}} \\ \frac{\frac{31}{2} - \frac{7}{2}}{k\cdot \text{SD of age}} &= -\frac{1}{3} \cdot \frac{24 - 33}{\text{SD of age}} \\ \frac{12}{k\cdot \text{SD of age}} &= -\frac{1}{3} \cdot \frac{-9}{\text{SD of age}} \\ \frac{12}{k\cdot \text{SD of age}} &= \frac{3}{\text{SD of age}} \\ \frac{k\cdot \text{SD of age}}{\text{SD of age}} &= \frac{12}{3}\\ k &= 4 \end{align*}


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



Problem 6

Raine is helping settle a debate between two friends on the “superior" season — winter or summer. In doing so, they try to understand the relationship between the number of sunshine hours per month in January and the number of sunshine hours per month in July across all cities in California in sun.

Raine finds the regression line that predicts the number of sunshine hours in July (y) for a city given its number of sunshine hours in January (x). In doing so, they find that the correlation between the two variables is \frac{2}{5}.


Problem 6.1

Which of these could be a scatter plot of number of sunshine hours in July vs. number of sunshine hours in January?

Answer: Option 1

Since r = \frac{2}{5}, the correct option must be a scatter plot with a mild positive (up and to the right) linear association. Option 3 can be ruled out immediately, since the linear association in it is negative (down and to the right). Option 2’s linear association is too strong for r = \frac{2}{5}, and Option 4’s linear association is too weak for r = \frac{2}{5}, which leaves Option 1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 6.2

Suppose the standard deviation of the number of sunshine hours in January for cities in California is equal to the standard deviation of the number of sunshine hours in July for cities in California.

Raine’s hometown of Santa Clarita saw 60 more sunshine hours in January than the average California city did. How many more sunshine hours than average does the regression line predict that Santa Clarita will have in July? Give your answer as a positive integer. (Hint: You’ll need to use the fact that the correlation between the two variables is \frac{2}{5}.)

Answer: 24

At a high level, we’ll start with the formula for the regression line in standard units, and re-write it in a form that will allow us to use the information provided to us in the question.

Recall, the regression line in standard units is

\text{predicted }y_{\text{(su)}} = r \cdot x_{\text{(su)}}

Using the definitions of \text{predicted }y_{\text{(su)}} and x_{\text{(su)}} gives us

\frac{\text{predicted } y - \text{mean of }y}{\text{SD of }y} = r \cdot \frac{x - \text{mean of }x}{\text{SD of }x}

Here, the x variable is sunshine hours in January and the y variable is sunshine hours in July. Given that the standard deviation of January and July sunshine hours are equal, we can simplifies our formula to

\text{predicted } y - \text{mean of }y = r \cdot (x - \text{mean of }x)

Since we’re asked how much more sunshine Santa Clarita will have in July compared to the average, we’re interested in the difference y - \text{mean of} y. We were given that Santa Clarita had 60 more sunshine hours in January than the average, and that the correlation between the two variables(correlation coefficient) is \frac{2}{5}. In terms of the variables above, then, we know:

  • x - \text{mean of }x = 60.

  • r = \frac{2}{5}.

Then,

\text{predicted } y - \text{mean of }y = r \cdot (x - \text{mean of }x) = \frac{2}{5} \cdot 60 = 24

Therefore, the regression line predicts that Santa Clarita will have 24 more sunshine hours than the average California city in July.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


As we know, San Diego was particularly cloudy this May. More generally, Anthony, another California native, feels that California is getting cloudier and cloudier overall.

To imagine what the dataset may look like in a few years, Anthony subtracts 5 from the number of sunshine hours in both January and July for all California cities in the dataset – i.e., he subtracts 5 from each x value and 5 from each y value in the dataset. He then creates a regression line to use the new xs to predict the new ys.


Problem 6.3

What is the slope of Anthony’s new regression line?

Answer: \frac{2}{5}

To determine the slope of Anthony’s new regression line, we need to understand how the modifications he made to the dataset (subtracting 5 hours from each x and y value) affect the slope. In simple linear regression, the slope of the regression line (m in y = mx + b) is calculated using the formula:

m = r \cdot \frac{\text{SD of y}}{\text{SD of x}}

r, the correlation coefficient between the two variables, remains unchanged in Anthony’s modifications. Remember, the correlation coefficient is the mean of the product of the x values and y values when both are measured in standard units; by subtracting the same constant amount from each x value, we aren’t changing what the x values convert to in standard units. If you’re not convinced, convert the following two arrays in Python to standard units; you’ll see that the results are the same.

x1 = np.array([5, 8, 4, 2, 9])
x2 = x1 - 5

Furthermore, Anthony’s modifications also don’t change the standard deviations of the x values or y values, since the xs and ys aren’t any more or less spread out after being shifted “down” by 5. So, since r, \text{SD of }y, and \text{SD of }x are all unchanged, the slope of the new regression line is the same as the slope of the old regression line, pre-modification!

Given the fact that the correlation coefficient is \frac{2}{5} and the standard deviation of sunshine hours in January (\text{SD of }x) is equal to the standard deviation of sunshine hours in July (\text{SD of }y), we have

m = r \cdot \frac{\text{SD of }y}{\text{SD of }x} = \frac{2}{5} \cdot 1 = \frac{2}{5}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.



Problem 6.4

Suppose the intercept of Raine’s original regression line – that is, before Anthony subtracted 5 from each x and each y – was 10. What is the intercept of Anthony’s new regression line?

Answer: 7

Let’s denote the original intercept as b and the new intercept in the new dataset as b'. The equation for the original regression line is y = mx + b, where:

  • y is a predicted number of sunshine hours in July, before 5 was subtracted from each number of hours.
  • m is the slope of the line, which we know is \frac{2}{5} from the previous part.
  • x is a number of sunshine hours in January, before 5 was subtracted from each number of hours.
  • b is the original intercept. This is 10.

When Anthony subtracts 5 from each x and y value, the new regression line becomes y - 5 = m \cdot (x - 5) + b'

Expanding and rearrange this equation, we have

y = mx - 5m + 5 + b'

Remember, x and y here represent the number of sunshine hours in January and July, respectively, before Anthony subtracted 5 from each number of hours. This means that the equation for y above is equivalent to y = mx + b. Comparing, we see that

-5m + 5 + b' = b

Since m = \frac{2}{5} (from the previous part) and b = 10, we have

-5 \cdot \frac{2}{5} + 5 + b' = 10 \implies b' = 10 - 5 + 2 = 7

Therefore, the intercept of Anthony’s new regression line is 7.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 34%.


Jasmine is trying to get as far away from Anthony as possible and has a trip to Chicago planned after finals. Chicago is known for being very warm and sunny in the summer but cold, rainy, and snowy in the winter. She decides to build a regression line that uses month of the year (where 1 is January, 2 is February, 12 is December, etc.) to predict the number of sunshine hours in Chicago.


Problem 7

The DataFrame games contains information about a sample of popular games. Besides other columns, there is a column "Complexity" that contains the average complexity of the game, a column "Rating" that contains the average rating of the game, and a column "Play Time" that contains the average play time of the game.

We use the regression line to predict a game’s "Rating" based on its "Complexity". We find that for the game Wingspan, which has a "Complexity" that is 2 points higher than the average, the predicted "Rating" is 3 points higher than the average.


Problem 7.1

What can you conclude about the correlation coefficient r?

Answer: r > 0

To answer this problem, it’s useful to recall the regression line in standard units:

\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}

If a value is positive in standard units, it means that it is above the average of the distribution that it came from, and if a value is negative in standard units, it means that it is below the average of the distribution that it came from. Since we’re told that Wingspan has a "Complexity" that is 2 points higher than the average, we know that x_{\text{(su)}} is positive. Since we’re told that the predicted "Rating" is 3 points higher than the average, we know that \text{predicted } y_{\text{(su)}} must also be positive. As a result, r must also be positive, since you can’t multiply a positive number (x_{\text{(su)}}) by a negative number and end up with another positive number.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 7.2

What can you conclude about the standard deviations of “Complexity” and “Rating”?

Answer: SD of "Complexity" < SD of "Rating"

Since the distance of the predicted "Rating" from its average is larger than the distance of the "Complexity" from its average, it might be reasonable to guess that the values in the "Rating" column are more spread out. This is true, but let’s see concretely why that’s the case.

Let’s start with the equation of the regression line in standard units from the previous subpart. Remember that here, x refers to "Complexity" and y refers to "Rating".

\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}

We know that to convert a value to standard units, we subtract the value by the mean of the column it came from, and divide by the standard deviation of the column it came from. As such, x_{\text{(su)}} = \frac{x - \text{mean of } x}{\text{SD of } x}. We can substitute this relationship in the regression line above, which gives us

\frac{\text{predicted } y - \text{mean of } y}{\text{SD of } y} = r \cdot \frac{x - \text{mean of } x}{\text{SD of } x}

To simplify things, let’s use what we were told. We were told that the predicted "Rating" was 3 points higher than average. This means that the numerator of the left side, \text{predicted } y - \text{mean of } y, is equal to 3. Similarly, we were told that the "Complexity" was 2 points higher than average, so x - \text{mean of } x is 2. Then, we have:

\frac{3}{\text{SD of } y} = \frac{2r}{\text{SD of }x}

Note that for convenience, we included r in the numerator on the right-hand side.

Remember that our goal is to compare the SD of "Rating" (y) to the SD of "Complexity" (x). We now have an equation that relates these two quantities! Since they’re both currently on the denominator, which can be tricky to work with, let’s take the reciprocal (i.e. “flip”) both fractions.

\frac{\text{SD of } y}{3} = \frac{\text{SD of }x}{2r}

Now, re-arranging gives us

\text{SD of } y \cdot \frac{2r}{3} = \text{SD of }x

Since we know that r is somewhere between 0 and 1, we know that \frac{2r}{3} is somewhere between 0 and \frac{2}{3}. This means that \text{SD of } x is somewhere between 0 and two-thirds of the value of \text{SD of } y, which means that no matter what, \text{SD of } x < \text{SD of } y. Remembering again that here "Complexity" is our x and "Rating" is our y, we have that the SD of "Complexity" is less than the SD of "Rating".


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.



Problem 8

Suppose that for children’s games, "Play Time" and "Rating" are negatively linearly associated due to children having short attention spans. Suppose that for children’s games, the standard deviation of "Play Time" is twice the standard deviation of "Rating", and the average "Play Time" is 10 minutes. We use linear regression to predict the "Rating" of a children’s game based on its "Play Time". The regression line predicts that Don’t Break the Ice, a children’s game with a "Play Time" of 8 minutes will have a "Rating" of 4. Which of the following could be the average "Rating" for children’s games?

Answer: 3.1

Let’s recall the formulas for the regression line in original units, since we’re given information in original units in this question (such as the fact that for a "Play Time" of 8 minutes, the predicted "Rating" is 4 stars). Remember that throughout this question, "Play Time" is our x and "Rating" is our y.

The regression line is of the form y = mx + b, where

m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, b = \text{mean of }y - m \cdot \text{mean of } x

There’s a lot of information provided to us in the question – let’s think about what it means in the context of our xs and ys.

  • The first piece is that r is negative, so -1 \leq r < 0.
  • The second piece is that \text{SD of } x = 2 \cdot (\text{SD of } y). Equivalently, we can say that \frac{\text{SD of } y}{\text{SD of } x} = \frac{1}{2}. This form is convenient, because it’s close to the definition of the slope of the regression line, m. Using this fact, the slope of the regression line is m = r \cdot \frac{\text{SD of } y}{\text{SD of }x} = r \cdot \frac{1}{2} = \frac{r}{2}.
  • The \text{mean of } x is 10. This means that the intercept of the regression line, b, is b = \text{mean of }y - m \cdot \text{mean of } x = \text{mean of }y - \frac{r}{2} \cdot 10 = \text{mean of }y - 5r.
  • If x is 8, the predicted y is 4.

Given all of this information, we need to find possible values for the \text{mean of } y. Substituting our known values for m and b into y = mx + b gives us

y = \frac{r}{2} x + \text{mean of }y - 5r

Now, using the fact that if if x = 8, the predicted y is 4, we have

\begin{align*}4 &= \frac{r}{2} \cdot 8 + \text{mean of }y - 5r\\4 &= 4r - 5r + \text{mean of }y\\ 4 + r &= \text{mean of} y\end{align*}

Cool! We now know that the \text{mean of } y is 4 + r. We know that r must satisfy the relationship -1 \leq r < 0. By adding 4 to all pieces of this inequality, we have that 3 \leq r + 4 < 4, which means that 3 \leq \text{mean of } y < 4. Of the four options provided, only one is greater than or equal to 3 and less than 4, which is 3.1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.


Problem 9

The American Kennel Club (AKC) organizes information about dog breeds. We’ve loaded their dataset into a DataFrame called df. The index of df contains the dog breed names as str values. Besides other columns, there is a column 'weight' (float) that contains typical weight (kg) and a column 'height' (float) that contains typical height (cm).

Sam wants to fit a linear model to predict a dog’s height using its weight.

He first runs the following code:

x = df.get('weight')
y = df.get('height')

def su(vals):
    return (vals - vals.mean()) / np.std(vals)


Problem 9.1

Select all of the Python snippets that correctly compute the correlation coefficient into the variable r.

Snippet 1:

r = (su(x) * su(y)).mean()

Snippet 2:

r = su(x * y).mean()

Snippet 3:

t = 0
for i in range(len(x)):
    t = t + su(x[i]) * su(y[i])
r = t / len(x)

Snippet 4:

t = np.array([])
for i in range(len(x)):
    t = np.append(t, su(x)[i] * su(y)[i])
r = t.mean()

Answer: Snippet 1 & 4

  • Snippet 1: Recall from the reference sheet, the correlation coefficient is r = (su(x) * su(y)).mean().

  • Snippet 2: We have to standardize each variable seperately so this snippet doesnt work.

  • Snippet 3: Note that for this snippet we’re standardizing each data point within each variable seperately, and so we’re not really standardizing the entire variable correctly. In other words, applying su(x[i]) to a singular data point is just going to convert this data point to zero, since we’re only inputting one data point into su().

  • Snippet 4: Note that this code is just the same as Snippet 1, except we’re now directly computing the product of each corresponding data points individually. Hence this Snippet works.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 9.2

Sam computes the following statistics for his sample:

The best-fit line predicts that a dog with a weight of 10 kg has a height of 45 cm.

What is the SD of dog heights?

Answer: Option 3: 10

The best fit line in original units are given by y = mx + b where m = r * (SD of y) / (SD of x) and b = (mean of y) - m * (mean of x) (refer to reference sheet). Let c be the STD of y, which we’re trying to find, then our best fit line is now y = (0.8*c/8)x + (50-(0.8*c/8)*15). Plugging the two values they gave us into our best fit line and simplifying gives 45 = 0.1*c*10 + (50 - 1.5*c) which simplifies to 45 = 50 - 0.5*c which gives us an answer of c = 10.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 9.3

Assume that the statistics in part b) still hold. Select all of the statements below that are true. (You don’t need to finish part b) in order to solve this question.)

Answer: Option 3 & 4

  • Option 1: We cannot determine whether two variables are linear simply from a line of best fit. The line of best fit just happens to find the best linear relationship between two varaibles, not whether or not the variables have a linear relationship.

  • Option 2: To calculate the root mean squared error, we need the actual data points so we can calculate residual values. Seeing that we don’t have access to the data points, we cannot say that the root mean squared error of the best-fit line is smaller than 5.

  • Option 3: This is true accrding to the problem statement given in part b

  • Option 4: This is true since we expect there to be a positive correlation between dog height and weight. So dogs that are lighter will also most likely be shorter. (ie a dog that is lighter than 15 kg will most likely be shorter than 50cm)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.