Lecture 21 — Practice

← return to practice.dsc10.com


Lecture 21 — Collected Practice Questions

Below are practice problems tagged for Lecture 21 (rendered directly from the original exam/quiz sources).


Problem 1

At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.

admissions_data = np.array([550, 1550, 400])


Problem 1.1

Complete the code below so that it creates an array admissions_proportions with the proportions of tickets sold to each group (in the order child, adult, senior).

def as_proportion(data):
    return __(a)__

admissions_proportions = as_proportion(admissions_data)

What goes in blank (a)?

Answer: data/data.sum()

To calculate proportion for each group, we divide each value in the array (tickets sold to each group) by the sum of all values (total tickets sold). Remember values in an array can be processed as a whole.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 1.2

The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.

model = np.array([0.25, 0.6, 0.15])

We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?

Answer: Child, adult, and senior tickets are purchased in proportions 0.25, 0.6, and 0.15. (Option 2)

Recall, null hypothesis is the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. So, we assume the distribution is the same as the model.


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 1.3

Which of the following test statistics could we use to test our hypotheses? Select all that could work.

Answer: sum of squared differences in proportions, mean of squared differences in proportions (Option 2, 4)

We need to use squared difference to avoid the case that large positive and negative difference cancel out in the process of calculating sum or mean, resulting in small sum of difference or mean of difference that does not reflect the actual deviation. So, we eliminate Option 1 and 3.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 1.4

Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.

Recall that the ticket counts we observed for children, adults, and seniors are stored in the array admissions_data = np.array([550, 1550, 400]), and that our model is model = np.array([0.25, 0.6, 0.15]).

For our hypothesis test to determine whether the admissions data is consistent with our model, what is the observed value of the test statistic? Give your answer as a number between 0 and 1, rounded to three decimal places. (Suppose that the value you calculated is assigned to the variable observed_stat, which you will use in later questions.)

Answer: 0.02

We first calculate the proportion for each value in admissions_data \frac{550}{550+1550+400} = 0.22 \frac{1550}{550+1550+400} = 0.62 \frac{400}{550+1550+400} = 0.16 So, we have the distribution of the admissions_data

Then, we calculate the observed value of the test statistic (the mean of the absolute differences in proportions) \frac{|0.22-0.25|+|0.62-0.6|+|0.16-0.15|}{number\ of\ goups} =\frac{0.03+0.02+0.01}{3} = 0.02


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 1.5

Now, we want to simulate the test statistic 10,000 times under the assumptions of the null hypothesis. Fill in the blanks below to complete this simulation and calculate the p-value for our hypothesis test. Assume that the variables admissions_data, admissions_proportions, model, and observed_stat are already defined as specified earlier in the question.

simulated_stats = np.array([]) 
for i in np.arange(10000):
    simulated_proportions = as_proportions(np.random.multinomial(__(a)__, __(b)__))
    simulated_stat = __(c)__
    simulated_stats = np.append(simulated_stats, simulated_stat)

p_value = __(d)__

What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?

Answer: (a) admissions_data.sum() (b) model (c) np.abs(simulated_proportions - model).mean() (d) np.count_nonzero(simulated_stats >= observed_stat) / 10000

Recall, in np.random.multinomial(n, [p_1, ..., p_k]), n is the number of experiments, and [p_1, ..., p_k] is a sequence of probability. The method returns an array of length k in which each element contains the number of occurrences of an event, where the probability of the ith event is p_i.

We want our simulated_proportion to have the same data size as admissions_data, so we use admissions_data.sum() in (a).

Since our null hypothesis is based on model, we simulate based on distribution in model, so we have model in (b).

In (c), we compute the mean of the absolute differences in proportions. np.abs(simulated_proportions - model) gives us a series of absolute differences, and .mean() computes the mean of the absolute differences.

In (d), we calculate the p_value. Recall, the p_value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. np.count_nonzero(simulated_stats >= observed_stat) gives us the number of simulated_stats greater than or equal to the observed_stat in the 10000 times simulations, so we need to divide it by 10000 to compute the proportion of simulated_stats greater than or equal to the observed_stat, and this gives us the p_value.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 1.6

True or False: the p-value represents the probability that the null hypothesis is true.

Answer: False

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. It only gives us the strength of evidence in favor of the null hypothesis, which is different from “the probability that the null hypothesis is true”.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 1.7

The new statistic that we used for this hypothesis test, the mean of the absolute differences in proportions, is in fact closely related to the total variation distance. Given two arrays of length three, array_1 and array_2, suppose we compute the mean of the absolute differences in proportions between array_1 and array_2 and store the result as madp. What value would we have to multiply madp by to obtain the total variation distance array_1 and array_2? Give your answer as a number rounded to three decimal places.

Answer: 1.5

Recall, the total variation distance (TVD) is the sum of the absolute differences in proportions, divided by 2. When we compute the mean of the absolute differences in proportions, we are computing the sum of the absolute differences in proportions, divided by the number of groups (which is 3). Thus, to get TVD, we first multiply our current statistics (the mean of the absolute differences in proportions) by 3, we get the sum of the absolute differences in proportions. Then according to the definition of TVD, we divide this value by 2. Thus, we have \text{current statistics}\cdot 3 / 2 = \text{current statistics}\cdot 1.5.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.



Source: sp23-final — Q6

Problem 2

Oren’s favorite bakery in San Diego is Wayfarer. After visiting frequently, he decides to learn how to make croissants and baguettes himself, and to do so, he books a trip to France.

Oren is interested in estimating the mean number of sunshine hours in July across all 10,000+ cities in France. Using the 16 French cities in sun, Oren constructs a 95% Central Limit Theorem (CLT)-based confidence interval for the mean sunshine hours of all cities in France. The interval is of the form [L, R], where L and R are positive numbers.


Problem 2.1

Which of the following expressions is equal to the standard deviation of the number of sunshine hours of the 16 French cities in sun?

Answer: R - L

Note that the 95% CI is of the form of the following:

[\text{Sample Mean} - 2 \cdot \text{SD of Distribution of Possible Sample Means}, \text{Sample Mean} + 2 \cdot \text{SD of Distribution of Possible Sample Means}]

This making its width 4 \cdot \text{SD of Distribution of Possible Sample Means}. We can use the square root law, the fact that we can use our sample’s SD as an estimate of our population’s SD when creating a confidence interval, and the fact that the sample size is 16, to re-write the width as:

\begin{align*} \text{width} &= 4 \cdot \text{SD of Distribution of Possible Sample Means} \\ &= 4 \cdot \left(\frac{\text{Population SD}}{\sqrt{\text{Sample Size}}}\right) \\ &\approx 4 \cdot \left(\frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}\right) \\ &= 4 \cdot \left(\frac{\text{Sample SD}}{4}\right) \\ &= \text{Sample SD} \end{align*}

Since \text{width} = \text{Sample SD}, and since \text{width} = R - L, we have that \text{Sample SD} = R - L.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.



Problem 2.2

True or False: There is a 95% chance that the interval [L, R] contains the mean number of sunshine hours in July of all 16 French cities in sun.

Answer: False

[L, R] contains the sample mean for sure, since it is centered at the sample mean. There is no probability associated with this fact since neither [L, R] nor the sample mean are random (given that our sample has already been drawn).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 2.3

True or False: If we collected 1,000 new samples of 16 French cities and computed the mean of each sample, then about 95% of the new sample means would be contained in [L, R].

Answer: False

It is true that if we collected many samples and used each one to make a 95% confidence interval, about 95% of those confidence intervals would contain the population mean. However, that’s not what this statement is addressing. Instead, it’s asking whether the one interval we created in particular, [L,R], would contain 95% of other samples’ means. In general, there’s no guarantee of the proportion of means of other samples that would fall in [L, R]; for instance, it’s possible that the sample that we used to create [L, R] was not a representative sample.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.



Problem 2.4

True or False: If we collected 1,000 new samples of 16 French cities and created a 95% confidence interval using each one, then chose one of the 1,000 new intervals at random, the chance that the randomly chosen interval contains the mean sunshine hours in July across all cities in France is approximately 95%.

Answer: True

It is true that if we collected many samples and used each one to make a 95% confidence interval, about 95% of those confidence intervals would contain the population mean, as we mentioned above. So, if we picked one of those confidence intervals at random, there’s an approximately 95% chance it would contain the population mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 2.5

True or False: The interval [L, R] is centered at the mean number of sunshine hours in July across all cities in France.

Answer: False

It is centered at our sample mean, which is the mean sunshine hours in July across all cities in France in sun, but not necessarily at the population mean. We don’t know where the population mean is!


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.


In addition to creating a 95% CLT-based confidence interval for the mean sunshine hours of all cities in France, Oren would like to create a 72% bootstrap-based confidence interval for the mean sunshine hours of all cities in France.

Oren resamples from the 16 French sunshine hours in sun 10,000 times and creates an array named french_sunshine containing 10,000 resampled means. He wants to find the left and right endpoints of his 72% confidence interval:

    boot_left = np.percentile(french_sunshine, __(a)__)
    boot_right = np.percentile(french_sunshine, __(b)__)


Problem 2.6

Fill in the blanks so that boot_left and boot_right evaluate to the left and right endpoints of a 72% confidence interval for the mean sunshine hours in July across all cities in France.

What goes in blanks (a) and (b)?

Answer: (a): 14, (b): 86

A 72% confidence interval is constructed by taking the middle 72% of the distribution of resampled means. This means we need to exclude 100\% - 72\% = 28\% of values – the smallest 14% and the largest 14%. Blank (a), then, is 14, and blank (b) is 100 - 14 = 86.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Suppose we are interested in testing the following pair of hypotheses.


Problem 2.7

Suppose that when Oren uses [boot_left, boot_right], his 72% bootstrap-based confidence interval, he fails to reject the null hypothesis above. If that’s the case, then when using [L, R], his 95% CLT-based confidence interval, what is the conclusion of his hypothesis test?

Answer: Impossible to tell

First, remember that we fail to reject the null whenever the parameter stated in the null hypothesis (225 in this case) is in the interval. So we’re told 225 is in the 72% bootstrapped interval. There’s a possibility that the 72% bootstrapped confidence interval isn’t completely contained within the 95% CLT interval, since the specific interval we get back with bootstrapping depends on the random resamples we get. What that means is that it’s possible for 225 to be in the 72% bootstrapped interval but not the 95% CLT interval, and it’s also possible for it to be in the 95% CLT interval. Therefore, given no other information it’s impossible to tell.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.



Problem 2.8

Suppose that Oren also creates a 72% CLT-based confidence interval for the mean sunshine hours of all cities in France in July using the same 16 French cities in sun he started with. When using his 72% CLT-based confidence interval, he fails to reject the null hypothesis above. If that’s the case, then when using [L, R], what is the conclusion of his hypothesis test?

Answer: Fail to reject the null

If 225 is in the 72% CLT interval, it must be in the 95% CLT interval, since the two intervals are centered at the same location and the 95% interval is just wider than the 72% interval. The main difference between this part and the previous one is the fact that this 72% interval was made with the CLT, not via bootstrapping, even though they’re likely to be similar.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 2.9

True or False: The significance levels of both hypothesis tests described in part (h) are equal.

Answer: False

When using a 72% confidence interval, the significance level, i.e. p-value cutoff, is 28%. When using a 95% confidence interval, the significance level is 5%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.



Source: sp24-final — Q10

Problem 3

We want to use the data in apts to test the following hypotheses:

While we could answer this question with a permutation test, in this problem we will explore another way to test these hypotheses. Since this is a question of whether two samples come from the same unknown population distribution, we need to construct a “population” to sample from. We will construct our “population” in the same way as we would for a permutation test, except we will draw our sample differently. Instead of shuffling, we will draw our two samples with replacement from the constructed “population”. We will use as our test statistic the difference in means between the two samples (in the order UTC minus elsewhere).


Problem 3.1

Suppose the data in apts, which has 800 rows, includes 85 apartments in UTC. Fill in the blanks below so that p_val evaluates to the p-value for this hypothesis test, which we will test according to the strategy outlined above.

diffs = np.array([])
for i in np.arange(10000):
    utc_sample_mean = __(a)__
    elsewhere_sample_mean = __(b)__
    diffs = np.append(diffs, utc_sample_mean - elsewhere_sample_mean)
observed_utc_mean = __(c)__
observed_elsewhere_mean = __(d)__
observed_diff = observed_utc_mean - observed_elsewhere_mean
p_val = np.count_nonzero(diffs __(e)__ observed_diff) / 10000

Answer:

    1. apts.sample(85, replace=True).get("Rent").mean()
    1. apts.sample(715, replace=True).get("Rent").mean()
    1. apts[apts.get("neighborhood")=="UTC"].get("Rent").mean()
    1. apts[apts.get("neighborhood")!="UTC"].get("Rent").mean()
    1. >=

For blanks (a) and (b), we can gather from context (hypothesis test description, variable names, and being inside of a for loop) that this portion of our code needs to repeatedly generate samples of size 85 (the number of observations in our dataset that are from UTC) and size 715 (the number of observations in our dataset that are not from UTC). We will then take the means of these samples and assign them to utc_sample_mean and elsewhere_sample_mean. We can generate these samples, with replacement, from the rows in our dataframe, hinting that the correct code for blanks (a) and (b) are: apts.sample(85, replace=True).get("Rent").mean() and apts.sample(715, replace=True).get("Rent").mean().

For blanks (c) and (d), this portion of the code needs to take our original dataframe and gather the observed means for apartments from UTC and apartments not from UTC. We can achieve this by querying our dataframe, grabbing the rent column, and taking the mean. This implies our correct code for blanks (c) and (d) are: apts[apts.get("neighborhood")=="UTC"].get("Rent").mean() and apts[apts.get("neighborhood")!="UTC"].get("Rent").mean().

For blank (e), we need to determine, based off of our null and alternative hypotheses, how we should compare the differences in found in our simulations against our observed difference. The alternative indicates that that values of rent of apartments in UTC are higher than the rent of other apartments in other neighborhoods. Since the observed statistic is calculated by observed_utc_mean - observed_elsewhere_mean, we want to use >= in (e) since greater values of diff indicate that observed_utc_mean is greater than observed_elsewhere_mean which is in the direction of the alternative hypothesis.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 3.2

Now suppose we tested the same hypothesses with a permutation test using the same test statistic. Which of your answers above (part a) would need to change? Select all that apply.

Answer: Blanks (a) and (b) would need to change. For a permutation test, we would shuffle the labels in our apts dataset and find the utc_sample_mean and elsewhere_sample_mean of this new shuffled dataset. Note that this process is done without replacement and that both of these sample means are calculated from the same shuffle of our dataset.

As it currently stands, our code for blanks (a) and (b) do not reflect this; the current process is sampling with replacement from two different shuffles of our dataset. So, blanks (a) and (b) must change.


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 3.3

Now suppose we test the following pair of hypotheses.

Then we can test this pair of hypotheses by constructing a 95% confidence interval for a parameter and checking if some particular number, x, falls in that confidence interval. To do this:

  1. What parameter should we construct a 95% confidence interval for? Your answer should be a phrase or short sentence.

  2. What is the value of x? Your answer should be a number.

  3. Suppose x is in the 95% confidence interval we create. Select all valid conclusions below.

Answer:

  • (i) The average rent of an apartment in UTC minus the average rent of an apartment elsewhere, or vice versa.
  • (ii) 0.
  • (iii) 3rd and 4th options.

For (i), we need to construct a confidence interval for a parameter that allows us to make assessments about our null and alternative hypotheses. Since these two hypotheses discuss whether or not there exists a difference, on average, for rents of apartments in UTC versus rents of apartments elsewhere, our parameter should be: the difference in rent for apartments in UTC and apartments elsewhere on average, or vice versa (The average rent of an apartment in UTC minus the average rent of an apartment elsewhere, or vice versa.)

For (ii), x must be 0 because the value zero holds special significance in our confidence interval; the inclusion of zero within our confidence interval suggests that “there is no difference between rent of apartments in UTC and apartments elsewhere, on average”. Whether or not zero is included within our confidence interval tells us whether we should fail to reject or reject the null hypothesis.

For (iii), if x = 0 lies within our 95% confidence interval, it suggests that there is a sizable chance that there is no difference between rent of apartments in UTC and apartments elsewhere, on average, which is a conclusion in favor of our null hypothesis; this means that any options which reject the null hypothesis, such as the 1st and 2st options, are wrong. The 3rd option (correctly) fails to the reject the null hypothesis at the 5% significance level, which is exactly what a 95% confidence interval that includes x = 0 would support. The 4th option is also correct because any evidence weak enough to fail to reject the null hypothesis at the 5% significance level will also fail at a tighter, more rigorous significance level (such as 1%).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.



Problem 4

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame breakdown has 4 rows and 50 columns – one row for each of the 4 shot types mentioned above, and one column for each of 50 different players. Each column of breakdown describes the distribution of shot types for a single player.

The first few columns of breakdown are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.


Problem 4.1

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.


What is the total variation distance (TVD) between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Answer: 0.2

Recall, the TVD is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each category are as follows:

  • Free Throws: |0.05 - 0.2| = 0.15
  • Threes: |0.35 - 0.2| = 0.15
  • Midrange: |0.35 - 0.3| = 0.05
  • Layups: |0.25 - 0.3| = 0.05

Then, we have

\text{TVD} = \frac{1}{2} (0.15 + 0.15 + 0.05 + 0.05) = 0.2


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 4.2

Recall, breakdown has information for 50 different players. We want to find the player whose shot distribution is the most similar to Kelsey Plum, i.e. has the lowest TVD with Kelsey Plum’s shot distribution.

Fill in the blanks below so that most_sim_player evaluates to the name of the player with the most similar shot distribution to Kelsey Plum. Assume that the column named 'Kelsey Plum' is the first column in breakdown (and again that breakdown has 50 columns total).

most_sim_player = ''
lowest_tvd_so_far = __(a)__
other_players = np.array(breakdown.columns).take(__(b)__)
for player in other_players:
    player_tvd = tvd(breakdown.get('Kelsey Plum'),
                     breakdown.get(player))
    if player_tvd < lowest_tvd_so_far:
        lowest_tvd_so_far = player_tvd
        __(c)__
  1. What goes in blank (a)?
  1. What goes in blank (b)?

  2. What goes in blank (c)?

Answers: 1, np.arange(1, 50), most_sim_player = player

Let’s try and understand the code provided to us. It appears that we’re looping over the names of all other players, each time computing the TVD between Kelsey Plum’s shot distribution and that player’s shot distribution. If the TVD calculated in an iteration of the for-loop (player_tvd) is less than the previous lowest TVD (lowest_tvd_so_far), the current player (player) is now the most “similar” to Kelsey Plum, and so we store their TVD and name (in most_sim_player).

Before the for-loop, we haven’t looked at any other players, so we don’t have values to store in most_sim_player and lowest_tvd_so_far. On the first iteration of the for-loop, both of these values need to be updated to reflect Kelsey Plum’s similarity with the first player in other_players. This is because, if we’ve only looked at one player, that player is the most similar to Kelsey Plum. most_sim_player is already initialized as an empty string, and we will specify how to “update” most_sim_player in blank (c). For blank (a), we need to pick a value of lowest_tvd_so_far that we can guarantee will be updated on the first iteration of the for-loop. Recall, TVDs range from 0 to 1, with 0 meaning “most similar” and 1 meaning “most different”. This means that no matter what, the TVD between Kelsey Plum’s distribution and the first player’s distribution will be less than 1*, and so if we initialize lowest_tvd_so_far to 1 before the for-loop, we know it will be updated on the first iteration.

  • It’s possible that the TVD between Kelsey Plum’s shot distribution and the first other player’s shot distribution is equal to 1, rather than being less than 1. If that were to happen, our code would still generate the correct answer, but lowest_tvd_so_far and most_sim_player wouldn’t be updated on the first iteration. Rather, they’d be updated on the first iteration where player_tvd is strictly less than 1. (We’d expect that the TVDs between all pairs of players are neither exactly 0 nor exactly 1, so this is not a practical issue.) To avoid this issue entirely, we could change if player_tvd < lowest_tvd_so_far to if player_tvd <= lowest_tvd_so_far, which would make sure that even if the first TVD is 1, both lowest_tvd_so_far and most_sim_player are updated on the first iteration.
  • Note that we could have initialized lowest_tvd_so_far to a value larger than 1 as well. Suppose we initialized it to 55 (an arbitrary positive integer). On the first iteration of the for-loop, player_tvd will be less than 55, and so lowest_tvd_so_far will be updated.

Then, we need other_players to be an array containing the names of all players other than Kelsey Plum, whose name is stored at position 0 in breakdown.columns. We are told that there are 50 players total, i.e. that there are 50 columns in breakdown. We want to take the elements in breakdown.columns at positions 1, 2, 3, …, 49 (the last element), and the call to np.arange that generates this sequence of positions is np.arange(1, 50). (Remember, np.arange(a, b) does not include the second integer!)

In blank (c), as mentioned in the explanation for blank (a), we need to update the value of most_sim_player. (Note that we only arrive at this line if player_tvd is the lowest pairwise TVD we’ve seen so far.) All this requires is most_sim_player = player, since player contains the name of the player who we are looking at in the current iteration of the for-loop.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 4.3

Let’s again consider the shot distributions of Kelsey Plum and Cheney Ogwumike.

We define the maximum squared distance (MSD) between two categorical distributions as the largest squared difference between the proportions of any category.

What is the MSD between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Answer: 0.023

Recall, in the solution to the first subpart of this problem, we calculated the absolute differences between the proportions of each category.

  • Free Throws: |0.05 - 0.2| = 0.15
  • Threes: |0.35 - 0.2| = 0.15
  • Midrange: |0.35 - 0.3| = 0.05
  • Layups: |0.25 - 0.3| = 0.05

The squared differences between the proportions of each category are computed by squaring the results in the list above (e.g. for Free Throws we’d have (0.05 - 0.2)^2 = 0.15^2). To find the maximum squared difference, then, all we need to do is find the largest of 0.15^2, 0.15^2, 0.05^2, and 0.05^2. Since 0.15 > 0.05, we have that the maximum squared distance is 0.15^2 = 0.0225, which rounds to 0.023.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 4.4

For your convenience, we show the first few columns of breakdown again below.

In basketball:

Suppose that Kelsey Plum is guaranteed to shoot exactly 10 shots a game. The type of each shot is drawn from the 'Kelsey Plum' column of breakdown (meaning that, for example, there is a 30% chance each shot is a layup).

Fill in the blanks below to complete the definition of the function simulate_points, which simulates the number of points Kelsey Plum scores in a single game. (simulate_points should return a single number.)

def simulate_points():
    shots = np.random.multinomial(__(a)__, breakdown.get('Kelsey Plum'))
    possible_points = np.array([2, 2, 3, 1])
    return __(b)__
  1. What goes in blank (a)?
  2. What goes in blank (b)?

Answers: 10, (shots * possible_points).sum()

To simulate the number of points Kelsey Plum scores in a single game, we need to:

  1. Simulate the number of shots she takes of each type (layups, midranges, threes, free throws).
  2. Using the simulated distribution in step 1, find the total number of points she scores – specifically, add 2 for every layup, 2 for every midrange, 3 for every three, and 1 for every free throw.

To simulate the number of shots she takes of each type, we use np.random.multinomial. This is because each shot, independently of all other shots, has a 30% chance of being a layup, a 30% chance of being a midrange, and so on. What goes in blank (a) is the number of shots she is taking in total; here, that is 10. shots will be an array of length 4 containing the number of shots of each type - for instance, shots may be np.array([3, 4, 2, 1]), which would mean she took 3 layups, 4 midranges, 2 threes, and 1 free throw.

Now that we have shots, we need to factor in how many points each type of shot is worth. This can be accomplished by multiplying shots with possible_points, which was already defined for us. Using the example where shots is np.array([3, 4, 2, 1]), shots * possible_points evaluates to np.array([6, 8, 6, 1]), which would mean she scored 6 points from layups, 8 points from midranges, and so on. Then, to find the total number of points she scored, we need to compute the sum of this array, either using the np.sum function or .sum() method. As such, the two correct answers for blank (b) are (shots * possible_points).sum() and np.sum(shots * possible_points).


Difficulty: ⭐️⭐️

The average score on this problem was 84%.



Problem 5

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame breakdown has 4 rows and 50 columns – one row for each of the 4 shot types mentioned above, and one column for each of 50 different players. Each column of breakdown describes the distribution of shot types for a single player.

The first few columns of breakdown are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.


Problem 5.1

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.


What is the total variation distance (TVD) between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Answer: 0.2

Recall, the TVD is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each category are as follows:

  • Free Throws: |0.05 - 0.2| = 0.15
  • Threes: |0.35 - 0.2| = 0.15
  • Midrange: |0.35 - 0.3| = 0.05
  • Layups: |0.25 - 0.3| = 0.05

Then, we have

\text{TVD} = \frac{1}{2} (0.15 + 0.15 + 0.05 + 0.05) = 0.2


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 5.2

Recall, breakdown has information for 50 different players. We want to find the player whose shot distribution is the most similar to Kelsey Plum, i.e. has the lowest TVD with Kelsey Plum’s shot distribution.

Fill in the blanks below so that most_sim_player evaluates to the name of the player with the most similar shot distribution to Kelsey Plum. Assume that the column named 'Kelsey Plum' is the first column in breakdown (and again that breakdown has 50 columns total).

most_sim_player = ''
lowest_tvd_so_far = __(a)__
other_players = np.array(breakdown.columns).take(__(b)__)
for player in other_players:
    player_tvd = tvd(breakdown.get('Kelsey Plum'),
                     breakdown.get(player))
    if player_tvd < lowest_tvd_so_far:
        lowest_tvd_so_far = player_tvd
        __(c)__
  1. What goes in blank (a)?
  1. What goes in blank (b)?

  2. What goes in blank (c)?

Answers: 1, np.arange(1, 50), most_sim_player = player

Let’s try and understand the code provided to us. It appears that we’re looping over the names of all other players, each time computing the TVD between Kelsey Plum’s shot distribution and that player’s shot distribution. If the TVD calculated in an iteration of the for-loop (player_tvd) is less than the previous lowest TVD (lowest_tvd_so_far), the current player (player) is now the most “similar” to Kelsey Plum, and so we store their TVD and name (in most_sim_player).

Before the for-loop, we haven’t looked at any other players, so we don’t have values to store in most_sim_player and lowest_tvd_so_far. On the first iteration of the for-loop, both of these values need to be updated to reflect Kelsey Plum’s similarity with the first player in other_players. This is because, if we’ve only looked at one player, that player is the most similar to Kelsey Plum. most_sim_player is already initialized as an empty string, and we will specify how to “update” most_sim_player in blank (c). For blank (a), we need to pick a value of lowest_tvd_so_far that we can guarantee will be updated on the first iteration of the for-loop. Recall, TVDs range from 0 to 1, with 0 meaning “most similar” and 1 meaning “most different”. This means that no matter what, the TVD between Kelsey Plum’s distribution and the first player’s distribution will be less than 1*, and so if we initialize lowest_tvd_so_far to 1 before the for-loop, we know it will be updated on the first iteration.

  • It’s possible that the TVD between Kelsey Plum’s shot distribution and the first other player’s shot distribution is equal to 1, rather than being less than 1. If that were to happen, our code would still generate the correct answer, but lowest_tvd_so_far and most_sim_player wouldn’t be updated on the first iteration. Rather, they’d be updated on the first iteration where player_tvd is strictly less than 1. (We’d expect that the TVDs between all pairs of players are neither exactly 0 nor exactly 1, so this is not a practical issue.) To avoid this issue entirely, we could change if player_tvd < lowest_tvd_so_far to if player_tvd <= lowest_tvd_so_far, which would make sure that even if the first TVD is 1, both lowest_tvd_so_far and most_sim_player are updated on the first iteration.
  • Note that we could have initialized lowest_tvd_so_far to a value larger than 1 as well. Suppose we initialized it to 55 (an arbitrary positive integer). On the first iteration of the for-loop, player_tvd will be less than 55, and so lowest_tvd_so_far will be updated.

Then, we need other_players to be an array containing the names of all players other than Kelsey Plum, whose name is stored at position 0 in breakdown.columns. We are told that there are 50 players total, i.e. that there are 50 columns in breakdown. We want to take the elements in breakdown.columns at positions 1, 2, 3, …, 49 (the last element), and the call to np.arange that generates this sequence of positions is np.arange(1, 50). (Remember, np.arange(a, b) does not include the second integer!)

In blank (c), as mentioned in the explanation for blank (a), we need to update the value of most_sim_player. (Note that we only arrive at this line if player_tvd is the lowest pairwise TVD we’ve seen so far.) All this requires is most_sim_player = player, since player contains the name of the player who we are looking at in the current iteration of the for-loop.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 5.3

Let’s again consider the shot distributions of Kelsey Plum and Cheney Ogwumike.

We define the maximum squared distance (MSD) between two categorical distributions as the largest squared difference between the proportions of any category.

What is the MSD between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Answer: 0.023

Recall, in the solution to the first subpart of this problem, we calculated the absolute differences between the proportions of each category.

  • Free Throws: |0.05 - 0.2| = 0.15
  • Threes: |0.35 - 0.2| = 0.15
  • Midrange: |0.35 - 0.3| = 0.05
  • Layups: |0.25 - 0.3| = 0.05

The squared differences between the proportions of each category are computed by squaring the results in the list above (e.g. for Free Throws we’d have (0.05 - 0.2)^2 = 0.15^2). To find the maximum squared difference, then, all we need to do is find the largest of 0.15^2, 0.15^2, 0.05^2, and 0.05^2. Since 0.15 > 0.05, we have that the maximum squared distance is 0.15^2 = 0.0225, which rounds to 0.023.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 5.4

For your convenience, we show the first few columns of breakdown again below.

In basketball:

Suppose that Kelsey Plum is guaranteed to shoot exactly 10 shots a game. The type of each shot is drawn from the 'Kelsey Plum' column of breakdown (meaning that, for example, there is a 30% chance each shot is a layup).

Fill in the blanks below to complete the definition of the function simulate_points, which simulates the number of points Kelsey Plum scores in a single game. (simulate_points should return a single number.)

def simulate_points():
    shots = np.random.multinomial(__(a)__, breakdown.get('Kelsey Plum'))
    possible_points = np.array([2, 2, 3, 1])
    return __(b)__
  1. What goes in blank (a)?
  2. What goes in blank (b)?

Answers: 10, (shots * possible_points).sum()

To simulate the number of points Kelsey Plum scores in a single game, we need to:

  1. Simulate the number of shots she takes of each type (layups, midranges, threes, free throws).
  2. Using the simulated distribution in step 1, find the total number of points she scores – specifically, add 2 for every layup, 2 for every midrange, 3 for every three, and 1 for every free throw.

To simulate the number of shots she takes of each type, we use np.random.multinomial. This is because each shot, independently of all other shots, has a 30% chance of being a layup, a 30% chance of being a midrange, and so on. What goes in blank (a) is the number of shots she is taking in total; here, that is 10. shots will be an array of length 4 containing the number of shots of each type - for instance, shots may be np.array([3, 4, 2, 1]), which would mean she took 3 layups, 4 midranges, 2 threes, and 1 free throw.

Now that we have shots, we need to factor in how many points each type of shot is worth. This can be accomplished by multiplying shots with possible_points, which was already defined for us. Using the example where shots is np.array([3, 4, 2, 1]), shots * possible_points evaluates to np.array([6, 8, 6, 1]), which would mean she scored 6 points from layups, 8 points from midranges, and so on. Then, to find the total number of points she scored, we need to compute the sum of this array, either using the np.sum function or .sum() method. As such, the two correct answers for blank (b) are (shots * possible_points).sum() and np.sum(shots * possible_points).


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 5.5

True or False: If we call simulate_points() 10,000 times and plot a histogram of the results, the distribution will look roughly normal.

Answer: True

The answer is True because of the Central Limit Theorem. Recall, the CLT states that no matter what the population distribution looks like, if you take many repeated samples with replacement, the distribution of the sample means and sample sums will be roughly normal. simulate_points() returns the sum of a sample of size 10 drawn with replacement from a population, and so if we generate many sample sums, the distribution of those sample sums will be roughly normal.

The distribution we are drawing from is the one below.

Type Points Probability
Layups 2 0.3
Midrange 2 0.3
Threes 3 0.2
Free Throws 1 0.2

Difficulty: ⭐️⭐️

The average score on this problem was 78%.