# Discussion 7: Permutation Testing and Bootstrapping

The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.

## Problem 1

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test, for a specific test statistic. Assume that adelie_chinstrap is a DataFrame of only Adelie and Chinstrap penguins, with just two columns – 'species' and 'mass'.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
# --- line (a) starts ---
# --- line (a) ends ---

# --- line (b) starts ---
# --- line (b) ends ---

grouped = with_shuffled.groupby('species').mean()

# --- line (c) starts ---
stat = grouped.get('mass').iloc - grouped.get('mass').iloc
# --- line (c) ends ---

stats = np.append(stats, stat)

### Problem 1.1

Which of the following statements best describe the procedure above?

• This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

• This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

• This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

• This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

Answer: This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass (Option 4)

Recall, a permutation test helps us decide whether two random samples come from the same distribution. This test matches our goal of testing whether the masses of Adelie penguins and Chinstrap penguins are drawn from the same population distribution. The code above are also doing steps of a permutation test. In part (a), it shuffles 'species' and stores the shuffled series to shuffled. In part (b), it assign the shuffled series of values to 'species' column. Then, it uses grouped = with_shuffled.groupby('species').mean() to calculate the mean of each species. In part (c), it computes the difference between mean mass of the two species by first getting the 'mass' column and then accessing mean mass of each group (Adelie and Chinstrap) with positional index 0 and 1.

##### Difficulty: ⭐️

The average score on this problem was 98%.

### Problem 1.2

Currently, line (c) (marked with a comment) uses .iloc. Which of the following options compute the exact same statistic as line (c) currently does?

Option 1:

stat = grouped.get('mass').loc['Adelie'] - grouped.get('mass').loc['Chinstrap']

Option 2:

stat = grouped.get('mass').loc['Chinstrap'] - grouped.get('mass').loc['Adelie']
• Option 1 only

• Option 2 only

• Both options

• Neither option

We use df.get(column_name).iloc[positional_index] to access the value in a column with positional_index. Similarly, we use df.get(column_name).loc[index] to access value in a column with its index. Remember grouped is a DataFrame that groupby('species'), so we have species name 'Adelie' and 'Chinstrap' as index for grouped.

Option 2 is incorrect since it does subtraction in the reverse order which results in a different stat compared to line(c). Its output will be -1 \cdot stat. Recall, in grouped = with_shuffled.groupby('species').mean(), we use groupby() and since 'species' is a column with string values, our index will be sorted in alphabetical order. So, .iloc is 'Adelie' and .iloc is 'Chinstrap'.

##### Difficulty: ⭐️⭐️

The average score on this problem was 81%.

### Problem 1.3

Is it possible to re-write line (c) in a way that uses .iloc twice, without any other uses of .loc or .iloc?

• Yes, it’s possible

• No, it’s not possible

There are multiple ways to achieve this. For instance stat = grouped.get('mass').iloc - grouped.sort_index(ascending = False).get('mass').iloc.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.

### Problem 1.4

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
# --- line (a) starts ---
# --- line (a) ends ---

# --- line (b) starts ---
# --- line (b) ends ---

grouped = with_shuffled.groupby('species').mean()

# --- line (c) starts ---
stat = grouped.get('mass').iloc - grouped.get('mass').iloc
# --- line (c) ends ---

stats = np.append(stats, stat)

What would happen if we removed line (a), and replaced line (b) with

with_shuffled = adelie_chinstrap.sample(adelie_chinstrap.shape, replace=False)

• This would still run a valid hypothesis test

• This would not run a valid hypothesis test, as all values in the stats array would be exactly the same

• This would not run a valid hypothesis test, even though there would be several different values in the stats array

• This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

Answer: This would not run a valid hypothesis test, as all values in the stats array would be exactly the same (Option 2)

Recall, DataFrame.sample(n, replace = False) (or DataFrame.sample(n) since replace = False is by default) returns a DataFrame by randomly sampling n rows from the DataFrame, without replacement. Since our n is adelie_chinstrap.shape, and we are sampling without replacement, we will get the exactly same Dataframe (though the order of rows may be different but the stats array would be exactly the same).

##### Difficulty: ⭐️⭐️

The average score on this problem was 87%.

### Problem 1.5

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
# --- line (a) starts ---
# --- line (a) ends ---

# --- line (b) starts ---
# --- line (b) ends ---

grouped = with_shuffled.groupby('species').mean()

# --- line (c) starts ---
stat = grouped.get('mass').iloc - grouped.get('mass').iloc
# --- line (c) ends ---

stats = np.append(stats, stat)

What would happen if we removed line (a), and replaced line (b) with

with_shuffled = adelie_chinstrap.sample(adelie_chinstrap.shape, replace=True)

• This would still run a valid hypothesis test

• This would not run a valid hypothesis test, as all values in the stats array would be exactly the same

• This would not run a valid hypothesis test, even though there would be several different values in the stats array

• This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

Answer: This would not run a valid hypothesis test, even though there would be several different values in the stats array (Option 3)

Recall, DataFrame.sample(n, replace = True) returns a new DataFrame by randomly sampling n rows from the DataFrame, with replacement. Since we are sampling with replacement, we will have a DataFrame which produces a stats array with some different values. However, recall, the key idea behind a permutation test is to shuffle the group labels. So, the above code does not meet this key requirement since we only want to shuffle the "species" column without changing the size of the two species. However, the code may change the size of the two species.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.

### Problem 1.6

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
# --- line (a) starts ---
# --- line (a) ends ---

# --- line (b) starts ---
# --- line (b) ends ---

grouped = with_shuffled.groupby('species').mean()

# --- line (c) starts ---
stat = grouped.get('mass').iloc - grouped.get('mass').iloc
# --- line (c) ends ---

stats = np.append(stats, stat)

What would happen if we replaced line (a) with

with_shuffled = adelie_chinstrap.assign(
)

and replaced line (b) with

with_shuffled = with_shuffled.assign(
)

• This would still run a valid hypothesis test

• This would not run a valid hypothesis test, as all values in the stats array would be exactly the same

• This would not run a valid hypothesis test, even though there would be several different values in the stats array

• This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

Answer: This would still run a valid hypothesis test (Option 1)

Our goal for the permutation test is to randomly assign birth weights to groups, without changing group sizes. The above code shuffles 'species' and 'mass' columns and assigns them back to the DataFrame. This fulfills our goal.

##### Difficulty: ⭐️⭐️

The average score on this problem was 81%.

### Problem 1.7

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic. Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

• 0

• \frac{1}{4}

• \frac{1}{3}

• \frac{2}{3}

• \frac{3}{4}

• 1

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. Thus, we compute the proportion of the test statistic that is equal or less than the observed statistic. (It is less than because less than corresponds to the alternative hypothesis “Chinstrap penguins weigh more on average than Adelie penguins”. Recall, when computing the statistic, we use Adelie’s mean mass minus Chinstrap’s mean mass. If Chinstrap’s mean mass is larger, the statistic will be negative, the direction of less than the observed statistic).

Thus, we look at the proportion of area less than or on the red line (which represents observed statistic), it is around \frac{1}{3}.

##### Difficulty: ⭐️⭐️

The average score on this problem was 80%.

## Problem 2

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame breakdown has 4 rows and 50 columns – one row for each of the 4 shot types mentioned above, and one column for each of 50 different players. Each column of breakdown describes the distribution of shot types for a single player.

The first few columns of breakdown are shown below. For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.

### Problem 2.1

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks. What is the total variation distance (TVD) between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Recall, the TVD is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each category are as follows:

• Free Throws: |0.05 - 0.2| = 0.15
• Threes: |0.35 - 0.2| = 0.15
• Midrange: |0.35 - 0.3| = 0.05
• Layups: |0.25 - 0.3| = 0.05

Then, we have

\text{TVD} = \frac{1}{2} (0.15 + 0.15 + 0.05 + 0.05) = 0.2

##### Difficulty: ⭐️⭐️

The average score on this problem was 84%.

### Problem 2.2

Recall, breakdown has information for 50 different players. We want to find the player whose shot distribution is the most similar to Kelsey Plum, i.e. has the lowest TVD with Kelsey Plum’s shot distribution.

Fill in the blanks below so that most_sim_player evaluates to the name of the player with the most similar shot distribution to Kelsey Plum. Assume that the column named 'Kelsey Plum' is the first column in breakdown (and again that breakdown has 50 columns total).

most_sim_player = ''
lowest_tvd_so_far = __(a)__
other_players = np.array(breakdown.columns).take(__(b)__)
for player in other_players:
player_tvd = tvd(breakdown.get('Kelsey Plum'),
breakdown.get(player))
if player_tvd < lowest_tvd_so_far:
lowest_tvd_so_far = player_tvd
__(c)__
1. What goes in blank (a)?
• -1

• -0.5

• 0

• 0.5

• 1

• np.array([])

• ''

1. What goes in blank (b)?

2. What goes in blank (c)?

Answers: 1, np.arange(1, 50), most_sim_player = player

Let’s try and understand the code provided to us. It appears that we’re looping over the names of all other players, each time computing the TVD between Kelsey Plum’s shot distribution and that player’s shot distribution. If the TVD calculated in an iteration of the for-loop (player_tvd) is less than the previous lowest TVD (lowest_tvd_so_far), the current player (player) is now the most “similar” to Kelsey Plum, and so we store their TVD and name (in most_sim_player).

Before the for-loop, we haven’t looked at any other players, so we don’t have values to store in most_sim_player and lowest_tvd_so_far. On the first iteration of the for-loop, both of these values need to be updated to reflect Kelsey Plum’s similarity with the first player in other_players. This is because, if we’ve only looked at one player, that player is the most similar to Kelsey Plum. most_sim_player is already initialized as an empty string, and we will specify how to “update” most_sim_player in blank (c). For blank (a), we need to pick a value of lowest_tvd_so_far that we can guarantee will be updated on the first iteration of the for-loop. Recall, TVDs range from 0 to 1, with 0 meaning “most similar” and 1 meaning “most different”. This means that no matter what, the TVD between Kelsey Plum’s distribution and the first player’s distribution will be less than 1*, and so if we initialize lowest_tvd_so_far to 1 before the for-loop, we know it will be updated on the first iteration.

• It’s possible that the TVD between Kelsey Plum’s shot distribution and the first other player’s shot distribution is equal to 1, rather than being less than 1. If that were to happen, our code would still generate the correct answer, but lowest_tvd_so_far and most_sim_player wouldn’t be updated on the first iteration. Rather, they’d be updated on the first iteration where player_tvd is strictly less than 1. (We’d expect that the TVDs between all pairs of players are neither exactly 0 nor exactly 1, so this is not a practical issue.) To avoid this issue entirely, we could change if player_tvd < lowest_tvd_so_far to if player_tvd <= lowest_tvd_so_far, which would make sure that even if the first TVD is 1, both lowest_tvd_so_far and most_sim_player are updated on the first iteration.
• Note that we could have initialized lowest_tvd_so_far to a value larger than 1 as well. Suppose we initialized it to 55 (an arbitrary positive integer). On the first iteration of the for-loop, player_tvd will be less than 55, and so lowest_tvd_so_far will be updated.

Then, we need other_players to be an array containing the names of all players other than Kelsey Plum, whose name is stored at position 0 in breakdown.columns. We are told that there are 50 players total, i.e. that there are 50 columns in breakdown. We want to take the elements in breakdown.columns at positions 1, 2, 3, …, 49 (the last element), and the call to np.arange that generates this sequence of positions is np.arange(1, 50). (Remember, np.arange(a, b) does not include the second integer!)

In blank (c), as mentioned in the explanation for blank (a), we need to update the value of most_sim_player. (Note that we only arrive at this line if player_tvd is the lowest pairwise TVD we’ve seen so far.) All this requires is most_sim_player = player, since player contains the name of the player who we are looking at in the current iteration of the for-loop.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

### Problem 2.3

Let’s again consider the shot distributions of Kelsey Plum and Cheney Ogwumike. We define the maximum squared distance (MSD) between two categorical distributions as the largest squared difference between the proportions of any category.

What is the MSD between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.

Recall, in the solution to the first subpart of this problem, we calculated the absolute differences between the proportions of each category.

• Free Throws: |0.05 - 0.2| = 0.15
• Threes: |0.35 - 0.2| = 0.15
• Midrange: |0.35 - 0.3| = 0.05
• Layups: |0.25 - 0.3| = 0.05

The squared differences between the proportions of each category are computed by squaring the results in the list above (e.g. for Free Throws we’d have (0.05 - 0.2)^2 = 0.15^2). To find the maximum squared difference, then, all we need to do is find the largest of 0.15^2, 0.15^2, 0.05^2, and 0.05^2. Since 0.15 > 0.05, we have that the maximum squared distance is 0.15^2 = 0.0225, which rounds to 0.023.

##### Difficulty: ⭐️⭐️

The average score on this problem was 85%.

### Problem 2.4

For your convenience, we show the first few columns of breakdown again below. • layups are worth 2 points,
• midrange shots are worth 2 points,
• threes are worth 3 points, and
• free throws are worth 1 point

Suppose that Kelsey Plum is guaranteed to shoot exactly 10 shots a game. The type of each shot is drawn from the 'Kelsey Plum' column of breakdown (meaning that, for example, there is a 30% chance each shot is a layup).

Fill in the blanks below to complete the definition of the function simulate_points, which simulates the number of points Kelsey Plum scores in a single game. (simulate_points should return a single number.)

def simulate_points():
shots = np.random.multinomial(__(a)__, breakdown.get('Kelsey Plum'))
possible_points = np.array([2, 2, 3, 1])
return __(b)__
1. What goes in blank (a)?
2. What goes in blank (b)?

Answers: 10, (shots * possible_points).sum()

To simulate the number of points Kelsey Plum scores in a single game, we need to:

1. Simulate the number of shots she takes of each type (layups, midranges, threes, free throws).
2. Using the simulated distribution in step 1, find the total number of points she scores – specifically, add 2 for every layup, 2 for every midrange, 3 for every three, and 1 for every free throw.

To simulate the number of shots she takes of each type, we use np.random.multinomial. This is because each shot, independently of all other shots, has a 30% chance of being a layup, a 30% chance of being a midrange, and so on. What goes in blank (a) is the number of shots she is taking in total; here, that is 10. shots will be an array of length 4 containing the number of shots of each type - for instance, shots may be np.array([3, 4, 2, 1]), which would mean she took 3 layups, 4 midranges, 2 threes, and 1 free throw.

Now that we have shots, we need to factor in how many points each type of shot is worth. This can be accomplished by multiplying shots with possible_points, which was already defined for us. Using the example where shots is np.array([3, 4, 2, 1]), shots * possible_points evaluates to np.array([6, 8, 6, 1]), which would mean she scored 6 points from layups, 8 points from midranges, and so on. Then, to find the total number of points she scored, we need to compute the sum of this array, either using the np.sum function or .sum() method. As such, the two correct answers for blank (b) are (shots * possible_points).sum() and np.sum(shots * possible_points).

##### Difficulty: ⭐️⭐️

The average score on this problem was 84%.

## Problem 3

For this question we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

We have access to the season DataFrame, which contains statistics on all players in the WNBA in the 2021 season. The first few rows of season are shown below. Each row in season corresponds to a single player. For each player, we have: - 'Player' (str), their name - 'Team' (str), the three-letter code of the team they play on - 'G' (int), the number of games they played in the 2021 season - 'PPG' (float), the number of points they scored per game played - 'APG' (float), the number of assists (passes) they made per game played - 'TPG' (float), the number of turnovers they made per game played

Note that all of the numerical columns in season must contain values that are greater than or equal to 0.

Suppose we only have access to the DataFrame small_season, which is a random sample of size 36 from season. We’re interested in learning about the true mean points per game of all players in season given just the information in small_season.

To start, we want to bootstrap small_season 10,000 times and compute the mean of the resample each time. We want to store these 10,000 bootstrapped means in the array boot_means.

Here is a broken implementation of this procedure.

boot_means = np.array([])
for i in np.arange(10000):
resample = small_season.sample(season.shape, replace=False)  # Line 1
resample_mean = small_season.get('PPG').mean()                  # Line 2
np.append(boot_means, new_mean)                                 # Line 3

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.

### Problem 3.1

What is incorrect about Line 1? Select all that apply.

• Currently the procedure samples from small_season, when it should be sampling from season

• The sample size is season.shape, when it should be small_season.shape

• Sampling is currently being done without replacement, when it should be done with replacement

• Line 1 is correct as-is

• The sample size is season.shape, when it should be small_season.shape
• Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from small_season. When bootstrapping, we sample with replacement from our original sample, with a sample size that’s equal to the original sample’s size. Here, our original sample is small_season, so we should be taking samples of size small_season.shape from it.

Option 1 is incorrect; season has nothing to do with this problem, as we are bootstrapping from small_season.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 3.2

What is incorrect about Line 2? Select all that apply.

• Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in season

• Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in resample

• .mean() is not a valid Series method, and should be replaced with a call to the function np.mean

• Line 2 is correct as-is

Answer: Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in resample

The current implementation of Line 2 doesn’t use the resample at all, when it should. If we were to leave Line 2 as it is, all of the values in boot_means would be identical (and equal to the mean of the 'PPG' column in small_season).

Option 1 is incorrect since our bootstrapping procedure is independent of season. Option 3 is incorrect because .mean() is a valid Series method.

##### Difficulty: ⭐️

The average score on this problem was 98%.

### Problem 3.3

What is incorrect about Line 3? Select all that apply.

• The result of calling np.append is not being reassigned to boot_means, so boot_means will be an empty array after running this procedure

• The indentation level of the line is incorrect – np.append should be outside of the for-loop (and aligned with for i)

• new_mean is not a defined variable name, and should be replaced with resample_mean

• Line 3 is correct as-is

• The result of calling np.append is not being reassigned to boot_means, so boot_means will be an empty array after running this procedure
• new_mean is not a defined variable name, and should be replaced with resample_mean

np.append returns a new array and does not modify the array it is called on (boot_means, in this case), so Option 1 is a necessary fix. Furthermore, Option 3 is a necessary fix since new_mean wasn’t defined anywhere.

Option 2 is incorrect; if np.append were outside of the for-loop, none of the 10,000 resampled means would be saved in boot_means.

##### Difficulty: ⭐️

The average score on this problem was 94%.

## Problem 4

IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings. An IKEA fan created an app where people can log the amount of time it took them to assemble their IKEA furniture. The DataFrame app_data has a row for each product build that was logged on the app. The columns are:

• 'product' (str): the name of the product, which includes the product line as the first word, followed by a description of the product
• 'category' (str): a categorical description of the type of product
• 'assembly_time' (str): the amount of time to assemble the product, formatted as 'x hr, y min' where x and y represent integers, possibly zero

The first few rows of app_data are shown below, though app_data has many more rows than pictured (5000 rows total). Assume that we have already run import babypandas as bpd and import numpy as np.

We want to use app_data to estimate the average amount of time it takes to build an IKEA bed (any product in the ‘bed’ category). Which of the following strategies would be an appropriate way to estimate this quantity? Select all that apply.

• Query to keep only the beds. Then resample with replacement many times. For each resample, take the mean of the 'minutes' column. Compute a 95% confidence interval based on those means.

• Query to keep only the beds. Group by 'product' using the mean aggregation function. Then resample with replacement many times. For each resample, take the mean of the 'minutes' column. Compute a 95% confidence interval based on those means.

• Resample with replacement many times. For each resample, first query to keep only the beds and then take the mean of the 'minutes' column. Compute a 95% confidence interval based on those means.

• Resample with replacement many times. For each resample, first query to keep only the beds. Then group by 'product' using the mean aggregation function, and finally take the mean of the 'minutes' column. Compute a 95% confidence interval based on those means.

Only the first answer is correct. This is a question of parameter estimation, so our approach is to use bootstrapping to create many resamples of our original sample, computing the average of each resample. Each resample should always be the same size as the original sample. The first answer choice accomplishes this by querying first to keep only the beds, then resampling from the DataFrame of beds only. This means resamples will have the same size as the original sample. Each resample’s mean will be computed, so we will have many resample means from which to construct our 95% confidence interval.

In the second answer choice, we are actually taking the mean twice. We first average the build times for all builds of the same product when grouping by product. This produces a DataFrame of different products with the average build time for each. We then resample from this DataFrame, computing the average of each resample. But this is a resample of products, not of product builds. The size of the resample is the number of unique products in app_data, not the number of reported product builds in app_data. Further, we get incorrect results by averaging numbers that are already averages. For example, if 5 people build bed A and it takes them each 1 hour, and 1 person builds bed B and it takes them 10 hours, the average amount of time to build a bed is \frac{5*1+10}{6} = 2.5. But if we average the times for bed A (1 hour) and average the times for bed B (5 hours), then average those, we get \frac{1+5}{2} = 3, which is not the same. More generally, grouping is not a part of the bootstrapping process because we want each data value to be weighted equally.

The last two answer choices are incorrect because they involve resampling from the full app_data DataFrame before querying to keep only the beds. This is incorrect because it does not preserve the sample size. For example, if app_data contains 1000 reported bed builds and 4000 other product builds, then the only relevant data is the 1000 bed build times, so when we resample, we want to consider another set of 1000 beds. If we resample from the full app_data DataFrame, our resample will contain 5000 rows, but the number of beds will be random, not necessarily 1000. If we query first to keep only the beds, then resample, our resample will contain exactly 1000 beds every time. As an added bonus, since we only care about beds, it’s much faster to resample from a smaller DataFrame of beds only than it is to resample from all app_data with plenty of rows we don’t care about.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.

## Problem 5

True or False: Suppose that from a sample, you compute a 95% bootstrapped confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.

False, a 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Note that the problem specifies that the confidence interval is bootstrapped. Since the interval is found using bootstrapping, L anr R averaged will not be the mean of the original sample since the mean of the original sample is not what is used in calculating the bootstrapped confidence interval. The bootstrapped confidence interval (as noted on the reference sheet is created by re-sampling the data with replacement over and over again). Thus, the interval is not centered around the mean because bootstrapping and random sampling will lead to variation in the confidence interval.

##### Difficulty: ⭐️⭐️

The average score on this problem was 87%.

## Problem 6

Suppose Tiffany has a random sample of dogs. Select the most appropriate technique to answer each of the following questions using Tiffany’s dog sample.

### Problem 6.1

Do small dogs typically live longer than medium and large dogs?

• Standard hypothesis test

• Permutation test

• Bootstrapping

We have two parameters: dog size and life expectancy. Here if there was no significant statistical difference between the life expectancy of different dog sizes, randomly assigning our sampled life expectancy to each dog should lead us to observe similar observations to the observed statistic. Thus using a permutation test to comapre the two groups makes the most sense. We’re not really trying to estimate a spcecific value so bootstrapping isn’t a good idea here. Also, there’s not really a good way to randomly generate life expectancies so a hypothesis test is not a good idea here.

##### Difficulty: ⭐️⭐️

The average score on this problem was 77%.

### Problem 6.2

Does Tiffany’s sample have an even distribution of dog kinds?

• Standard hypothesis test

• Permutation test

• Bootstrapping

Answer: Option 1: Standard hypothesis test.

We’re not really comparing a variable between two groups, but rather looking at the overall distribution, so Permutation testing wouldn’t work too well here. Again, we’re not really trying to estimate anything here so bootstrapping isn’t a good idea. This leaves us with the Standard Hypothesis Test, which makes sense if we use Total Variation Distance as our test statistic.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.

### Problem 6.3

What’s the median weight for herding dogs?

• Standard hypothesis test

• Permutation test

• Bootstrapping

Here we’re trying to determine a specific value, which immediately leads us to bootstrapping. The other two tests wouldn’t really make sense in this context.

##### Difficulty: ⭐️⭐️

The average score on this problem was 83%.

### Problem 6.4

Do dogs live longer than 12 years on average?

• Standard hypothesis test

• Permutation test

• Bootstrapping