← return to practice.dsc10.com

These problems are taken from past quizzes and exams. Work on them
**on paper**, since the quizzes and exams you take in this
course will also be on paper.

We encourage you to complete these
problems during discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all of
these problems during the discussion section**; the problems we don’t
cover can be used for extra practice.

Arya was curious how many UCSD students used Hulu over Thanksgiving break. He surveys 250 students and finds that 130 of them did use Hulu over break and 120 did not.

Using this data, Arya decides to test following hypotheses:

**Null Hypothesis**: Over Thanksgiving break, an equal number of UCSD students did use Hulu and did not use Hulu.**Alternative Hypothesis**: Over Thanksgiving break,**more**UCSD students did use Hulu than did not use Hulu.

Which of the following could be used as a test statistic for the hypothesis test?

The proportion of students who did use Hulu minus the proportion of students who did not use Hulu.

The absolute value of the proportion of students who did use Hulu minus the proportion of students who did not use Hulu.

The proportion of students who did use Hulu plus the proportion of students who did not use Hulu.

The absolute value of the proportion of students who did use Hulu plus the proportion of students who did not use Hulu.

For the test statistic that you chose in part (a), what is the observed value of the statistic? Give your answer either as an exact decimal or a simplified fraction.

If the p-value of the hypothesis test is 0.053, what can we conclude, at the standard 0.05 significance level?

We reject the null hypothesis.

We fail to reject the null hypothesis.

We accept the null hypothesis.

At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.

`= np.array([550, 1550, 400]) admissions_data `

Complete the code below so that it creates an array
`admissions_proportions`

with the proportions of tickets sold
to each group (in the order child, adult, senior).

```
def as_proportion(data):
return __(a)__
= as_proportion(admissions_data) admissions_proportions
```

What goes in blank (a)?

The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.

`= np.array([0.25, 0.6, 0.15]) model `

We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?

Child, adult, and senior tickets might plausibly be purchased in proportions 0.25, 0.6, and 0.15.

Child, adult, and senior tickets are purchased in proportions 0.25, 0.6, and 0.15.

Child, adult, and senior tickets might plausibly be purchased in proportions other than 0.25, 0.6, and 0.15.

Child, adult, and senior tickets, are purchased in proportions other than 0.25, 0.6, and 0.15.

Which of the following test statistics could we use to test our hypotheses? Select all that could work.

sum of differences in proportions

sum of squared differences in proportions

mean of differences in proportions

mean of squared differences in proportions

none of the above

Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.

Recall that the ticket counts we observed for children, adults, and
seniors are stored in the array
`admissions_data = np.array([550, 1550, 400])`

, and that our
model is `model = np.array([0.25, 0.6, 0.15])`

.

For our hypothesis test to determine whether the admissions data is
consistent with our model, what is the observed value of the test
statistic? Input your answer as a decimal between 0 and 1. Round to
three decimal places. (Suppose that the value you calculated is assigned
to the variable `observed_stat`

, which you will use in later
questions.)

Now, we want to simulate the test statistic 10,000 times under the
assumptions of the null hypothesis. Fill in the blanks below to complete
this simulation and calculate the p-value for our hypothesis test.
Assume that the variables `admissions_data`

,
`admissions_proportions`

, `model`

, and
`observed_stat`

are already defined as specified earlier in
the question.

```
= np.array([])
simulated_stats for i in np.arange(10000):
= as_proportions(np.random.multinomial(__(a)__, __(b)__))
simulated_proportions = __(c)__
simulated_stat = np.append(simulated_stats, simulated_stat)
simulated_stats
= __(d)__ p_value
```

What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?

True or False: the p-value represents the probability that the null hypothesis is true.

True

False

The new statistic that we used for this hypothesis test, the mean of
the absolute differences in proportions, is in fact closely related to
the total variation distance. Given two arrays of length three,
`array_1`

and `array_2`

, suppose we compute the
mean of the absolute differences in proportions between
`array_1`

and `array_2`

and store the result as
`madp`

. What value would we have to multiply
`madp`

by to obtain the total variation distance
`array_1`

and `array_2`

? Input your answer below,
rounding to three decimal places.

For this question, let’s think of the data in `app_data`

as a random sample of all IKEA purchases and use it to test the
following hypotheses.

**Null Hypothesis**: IKEA sells an equal amount of beds
(category `'bed'`

) and outdoor furniture (category
`'outdoor'`

).

**Alternative Hypothesis**: IKEA sells more beds than
outdoor furniture.

The DataFrame `app_data`

contains 5000 rows, which form
our sample. Of these 5000 products,

- 1000 are beds,
- 1500 are outdoor furniture, and
- 2500 are in another category.

Which of the following **could** be used as the test
statistic for this hypothesis test? Select all that apply.

Among 2500 beds and outdoor furniture items, the absolute difference between the proportion of beds and the proportion of outdoor furniture.

Among 2500 beds and outdoor furniture items, the proportion of beds.

Among 2500 beds and outdoor furniture items, the number of beds.

Among 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items.

Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.

Complete the code below to calculate the observed value of the test
statistic and save the result as `obs_diff`

.

```
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = ( ___(a)___ - ___(b)___ ) / ___(c)___ obs_diff
```

The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.

Which of the following is a valid way to generate one value of the test statistic according to the null model? Select all that apply.

Way 1:

```
= np.random.multinomial(2500, [0.5,0.5])
multi 0] - multi[1])/2500 (multi[
```

Way 2:

```
= np.random.multinomial(2500, [0.5,0.5])[0]/2500
outdoor = np.random.multinomial(2500, [0.5,0.5])[1]/2500
bed - bed outdoor
```

Way 3:

```
= np.random.choice([0, 1], 2500, replace=True)
choice = choice.sum( )
choice_sum - (2500 - choice_sum))/2500 (choice_sum
```

Way 4:

```
= np.random.choice(['bed', 'outdoor'], 2500, replace=True)
choice = np.count_nonzero(choice=='bed')
bed = np.count_nonzero(choice=='outdoor')
outdoor /2500 - bed/2500 outdoor
```

Way 5:

```
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = app_data[outdoor|bed].sample(2500, replace=True)
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500) samp[samp.get(
```

Way 6:

```
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = (app_data[outdoor|bed].groupby('category').count( ).reset_index( ).sample(2500, replace=True))
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500 samp[samp.get(
```

Way 1

Way 2

Way 3

Way 4

Way 5

Way 6

Suppose we generate 10,000 simulated values of the test statistic
according to the null model and store them in an array called
`simulated_diffs`

. Complete the code below to calculate the
p-value for the hypothesis test.

`/10000 np.count_nonzero(simulated_diffs _________ obs_diff)`

What goes in the blank?

`<`

`<=`

`>`

`>=`

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame `breakdown`

has 4 rows and 50 columns – one
row for each of the 4 shot types mentioned above, and one column for
each of 50 different players. Each column of `breakdown`

describes the distribution of shot types for a single player.

The first few columns of `breakdown`

are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.

What is the **total variation distance** (TVD) between
Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution?
Give your answer as a **proportion** between 0 and 1 (not a
percentage) rounded to three decimal places.

Recall, `breakdown`

has information for 50 different
players. We want to find the player whose shot distribution is the
**most similar to Kelsey Plum**, i.e. has the lowest TVD
with Kelsey Plum’s shot distribution.

Fill in the blanks below so that `most_sim_player`

evaluates to the name of the player with the most similar shot
distribution to Kelsey Plum. Assume that the column named
`'Kelsey Plum'`

is the first column in `breakdown`

(and again that `breakdown`

has 50 columns total).

```
= ''
most_sim_player = __(a)__
lowest_tvd_so_far = np.array(breakdown.columns).take(__(b)__)
other_players for player in other_players:
= tvd(breakdown.get('Kelsey Plum'),
player_tvd
breakdown.get(player))if player_tvd < lowest_tvd_so_far:
= player_tvd
lowest_tvd_so_far __(c)__
```

- What goes in blank (a)?

-1

-0.5

0

0.5

1

`np.array([])`

`''`

What goes in blank (b)?

What goes in blank (c)?

You survey 100 DSC majors and 140 CSE majors to ask them which video streaming service they use most. The resulting distributions are given in the table below. Note that each column sums to 1.

Service |
DSC Majors |
CSE Majors |
---|---|---|

Netflix | 0.4 | 0.35 |

Hulu | 0.25 | 0.2 |

Disney+ | 0.1 | 0.1 |

Amazon Prime Video | 0.15 | 0.3 |

Other | 0.1 | 0.05 |

For example, 20% of CSE Majors said that Hulu is their most used video streaming service. Note that if a student doesn’t use video streaming services, their response is counted as Other.

What is the total variation distance (TVD) between the distribution for DSC majors and the distribution for CSE majors? Give your answer as an exact decimal.

Suppose we only break down video streaming services into four categories: Netflix, Hulu, Disney+, and Other (which now includes Amazon Prime Video). Now we recalculate the TVD between the two distributions. How does the TVD now compare to your answer to part (a)?

less than (a)

equal to (a)

greater than (a)

In some cities, the number of sunshine hours per month is relatively consistent throughout the year. São Paulo, Brazil is one such city; in all months of the year, the number of sunshine hours per month is somewhere between 139 and 173. New York City’s, on the other hand, ranges from 139 to 268.

Gina and Abel, both San Diego natives, are interested in assessing how “consistent" the number of sunshine hours per month in San Diego appear to be. Specifically, they’d like to test the following hypotheses:

**Null Hypothesis**: The number of sunshine hours per month in San Diego is drawn from the uniform distribution, \left[\frac{1}{12}, \frac{1}{12}, ..., \frac{1}{12}\right]. (In other words, the number of sunshine hours per month in San Diego is equal in all 12 months of the year.)**Alternative Hypothesis**: The number of sunshine hours per month in San Diego is not drawn from the uniform distribution.

As their test statistic, Gina and Abel choose the total variation distance. To simulate samples under the null, they will sample from a categorical distribution with 12 categories — January, February, and so on, through December — each of which have an equal probability of being chosen.

In order to run their hypothesis test, Gina and Abel need a way to calculate their test statistic. Below is an incomplete implementation of a function that computes the TVD between two arrays of length 12, each of which represent a categorical distribution.

```
def calculate_tvd(dist1, dist2):
return np.mean(np.abs(dist1 - dist2)) * ____
```

Fill in the blank so that `calculate_tvd`

works as
intended.

`1 / 6`

`1 / 3`

`1 / 2`

`2`

`3`

`6`

**Moving forward, assume that calculate_tvd works
correctly.**

Now, complete the implementation of the function
`uniform_test`

, which takes in an array
`observed_counts`

of length 12 containing the number of
sunshine hours each month in a city and returns the p-value for the
hypothesis test stated at the start of the question.

```
def uniform_test(observed_counts):
# The values in observed_counts are counts, not proportions!
= observed_counts.sum()
total_count = __(b)__
uniform_dist = np.array([])
tvds for i in np.arange(10000):
= __(c)__
simulated = calculate_tvd(simulated, __(d)__)
tvd = np.append(tvds, tvd)
tvds return np.mean(tvds __(e)__ calculate_tvd(uniform_dist, __(f)__))
```

What goes in blank (b)? *(Hint: The function
np.ones(k) returns an array of length k in
which all elements are 1.)*

What goes in blank (c)?

`np.random.multinomial(12, uniform_dist)`

`np.random.multinomial(12, uniform_dist) / 12`

`np.random.multinomial(12, uniform_dist) / total_count`

`np.random.multinomial(total_count, uniform_dist)`

`np.random.multinomial(total_count, uniform_dist) / 12`

`np.random.multinomial(total_count, uniform_dist) / total_count`

What goes in blank (d)?

What goes in blank (e)?

`>`

`>=`

`<`

`<=`

`==`

`!=`

What goes in blank (f)?

For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code `'LVA'`

). **In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.**

The DataFrame `plum`

contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of `plum`

are shown below (though the full DataFrame has 31 rows, not 5):

Each row in `plum`

corresponds to a single game. For each
game, we have:

`'Date'`

(`str`

), the date on which the game was played`'Opp'`

(`str`

), the three-letter code of the opponent team`'Home'`

(`bool`

),`True`

if the game was played in Las Vegas (“home”) and`False`

if it was played at the opponent’s arena (“away”)`'Won'`

(`bool`

),`True`

if the Las Vegas Aces won the game and`False`

if they lost`'PTS'`

(`int`

), the number of points Kelsey Plum scored in the game`'AST'`

(`int`

), the number of assists (passes) Kelsey Plum made in the game`'TOV'`

(`int`

), the number of turnovers Kelsey Plum made in the game (a turnover is when you lose the ball – turnovers are bad!)

Consider the definition of the function
`diff_in_group_means`

:

```
def diff_in_group_means(df, group_col, num_col):
= df.groupby(group_col).mean().get(num_col)
s return s.loc[False] - s.loc[True]
```

It turns out that Kelsey Plum averages 0.61 more assists in games
that she wins (“winning games”) than in games that she loses (“losing
games”). Fill in the blanks below so that `observed_diff`

evaluates to -0.61.

`= diff_in_group_means(plum, __(a)__, __(b)__) observed_diff `

What goes in blank (a)?

What goes in blank (b)?

After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:

**Null Hypothesis:**The number of assists Kelsey Plum makes in winning games and in losing games come from the same distribution.**Alternative Hypothesis:**The number of assists Kelsey Plum makes in winning games is higher on average than the number of assists that she makes in losing games.

To conduct our permutation test, we place the following code in a
`for`

-loop.

```
= plum.get('Won')
won = plum.get('AST')
ast = plum.assign(Won_shuffled=np.random.permutation(won)) \
shuffled =np.random.permutation(ast)) .assign(AST_shuffled
```

Which of the following options **does not** compute a
valid simulated test statistic for this permutation test?

`diff_in_group_means(shuffled, 'Won', 'AST')`

`diff_in_group_means(shuffled, 'Won', 'AST_shuffled')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST_shuffled')`

More than one of these options do not compute a valid simulated test statistic for this permutation test

Suppose we generate 10,000 simulated test statistics, using one of
the valid options from part 1. The empirical distribution of test
statistics, with a red line at `observed_diff`

, is shown
below.

Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?

There is roughly a one quarter probability that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution.

The significance level of this hypothesis test is roughly a quarter.

Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging

**at least**0.61 more assists in wins than losses is roughly a quarter.Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging 0.61 more assists in wins than losses is roughly a quarter.

An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
`app_data`

has a row for each product build that was logged
on the app. The column `'product'`

contains the name of the
product, and the column `'minutes'`

contains integer values
representing the number of minutes it took to assemble each product.

You are browsing the IKEA showroom, deciding whether to purchase the
BILLY bookcase or the LOMMARP bookcase. You are concerned about the
amount of time it will take to assemble your new bookcase, so you look
up the assembly times reported in `app_data`

. Thinking of the
data in `app_data`

as a random sample of all IKEA purchases,
you want to perform a permutation test to test the following
hypotheses.

**Null Hypothesis**: The assembly time for the BILLY
bookcase and the assembly time for the LOMMARP bookcase come from the
same distribution.

**Alternative Hypothesis**: The assembly time for the
BILLY bookcase and the assembly time for the LOMMARP bookcase come from
different distributions.

Suppose we query `app_data`

to keep only the BILLY
bookcases, then average the `'minutes'`

column. In addition,
we separately query `app_data`

to keep only the LOMMARP
bookcases, then average the `'minutes'`

column. If the null
hypothesis is true, which of the following statements about these two
averages is correct?

These two averages are the same.

Any difference between these two averages is due to random chance.

Any difference between these two averages cannot be ascribed to random chance alone.

The difference between these averages is statistically significant.

For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.

Complete the code below to generate one simulated value of the test
statistic in a new way, without using
`np.random.permutation`

.

```
= (app_data.get('product') ==
billy 'BILLY Bookcase, white, 31 1/2x11x79 1/2')
= (app_data.get('product') ==
lommarp 'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
= app_data[billy|lommarp]
billy_lommarp = np.random.choice(billy_lommarp.get('minutes'), billy.sum(), replace=False).mean()
billy_mean = _________
lommarp_mean - lommarp_mean billy_mean
```

What goes in the blank?

`billy_lommarp[lommarp].get('minutes').mean()`

`np.random.choice(billy_lommarp.get('minutes'), lommarp.sum(), replace=False).mean()`

`billy_lommarp.get('minutes').mean() - billy_mean`

`(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()`

The DataFrame `apps`

contains application data for a
random sample of 1,000 applicants for a particular credit card from the
1990s. The columns are:

`"status" (str)`

: Whether the credit card application was approved:`"approved"`

or`"denied"`

values only.`"age" (float)`

: The applicant’s age, in years, to the nearest twelfth of a year.`"income" (float)`

: The applicant’s annual income, in tens of thousands of dollars.`"homeowner" (str)`

: Whether the credit card applicant owns their own home:`"yes"`

or`"no"`

values only.`"dependents" (int)`

: The number of dependents, or individuals that rely on the applicant as a primary source of income, such as children.

The first few rows of `apps`

are shown below, though
remember that `apps`

has 1,000 rows.

In `apps`

, our sample of 1,000 credit card applications,
applicants who were approved for the credit card have fewer dependents,
on average, than applicants who were denied. The mean number of
dependents for approved applicants is 0.98, versus 1.07 for denied
applicants.

To test whether this difference is purely due to random chance, or whether the distributions of the number of dependents for approved and denied applicants are truly different in the population of all credit card applications, we decide to perform a permutation test.

Consider the incomplete code block below.

```
def shuffle_status(df):
= np.random.permutation(df.get("status"))
shuffled_status return df.assign(status=shuffled_status).get(["status", "dependents"])
def test_stat(df):
= df.groupby("status").mean().get("dependents")
grouped = grouped.loc["approved"]
approved = grouped.loc["denied"]
denied return __(a)__
= np.array([])
stats for i in np.arange(10000):
= shuffle_status(apps)
shuffled_apps = test_stat(shuffled_apps)
stat = np.append(stats, stat)
stats
= np.count_nonzero(__(b)__) / 10000 p_value
```

Below are six options for filling in blanks (a) and (b) in the code above.

Blank (a) | Blank (b) | |
---|---|---|

Option 1 | `denied - approved` |
`stats >= test_stat(apps)` |

Option 2 | `denied - approved` |
`stats <= test_stat(apps)` |

Option 3 | `approved - denied` |
`stats >= test_stat(apps)` |

Option 4 | `np.abs(denied - approved)` |
`stats >= test_stat(apps)` |

Option 5 | `np.abs(denied - approved)` |
`stats <= test_stat(apps)` |

Option 6 | `np.abs(approved - denied)` |
`stats >= test_stat(apps)` |

The correct way to fill in the blanks depends on how we choose our null and alternative hypotheses.

Suppose we choose the following pair of hypotheses.

**Null Hypothesis**: In the population, the number of dependents of approved and denied applicants come from the same distribution.**Alternative Hypothesis**: In the population, the number of dependents of approved applicants and denied applicants do not come from the same distribution.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.

Option 1

Option 2

Option 3

Option 4

Option 5

Option 6

None of the above.

Now, suppose we choose the following pair of hypotheses.

**Null Hypothesis**: In the population, the number of dependents of approved and denied applicants come from the same distribution.**Alternative Hypothesis**: In the population, the number of dependents of approved applicants is smaller on average than the number of dependents of denied applicants.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.

Option 6 from the start of this question is repeated below.

Blank (a) | Blank (b) | |
---|---|---|

Option 6 | `np.abs(approved - denied)` |
`stats >= test_stat(apps)` |

We want to create a new option, Option 7, that replicates the behavior of Option 6, but with blank (a) filled in as shown:

Blank (a) | Blank (b) | |
---|---|---|

Option 7 | `approved - denied` |

Which expression below could go in blank (b) so that Option 7 is equivalent to Option 6?

`np.abs(stats) >= test_stat(apps)`

`stats >= np.abs(test_stat(apps))`

`np.abs(stats) >= np.abs(test_stat(apps))`

`np.abs(stats >= test_stat(apps))`

In our implementation of this permutation test, we followed the
procedure outlined in lecture to draw new pairs of samples under the
null hypothesis and compute test statistics — that is, we randomly
assigned each row to a group (approved or denied) by shuffling one of
the columns in `apps`

, then computed the test statistic on
this random pair of samples.

Let’s now explore an alternative solution to drawing pairs of samples under the null hypothesis and computing test statistics. Here’s the approach:

- Shuffle, i.e. re-order, the rows of the DataFrame.
- Use the values at the top of the resulting
`"dependents"`

column as the new “denied” sample, and the values at the at the bottom of the resulting`"dependents"`

column as the new “approved” sample. Note that we don’t necessarily split the DataFrame exactly in half — the sizes of these new samples depend on the number of “denied” and “approved” values in the original DataFrame!

Once we generate our pair of random samples in this way, we’ll
compute the test statistic on the random pair, as usual. Here, we’ll use
as our test statistic the difference between the mean number of
dependents for denied and approved applicants, in the order
**denied minus approved**.

**Fill in the blanks to complete the simulation
below.**

*Hint:* `np.random.permutation`

shouldn’t appear
anywhere in your code.

```
def shuffle_all(df):
'''Returns a DataFrame with the same rows as df, but reordered.'''
return __(a)__
def fast_stat(df):
# This function does not and should not contain any randomness.
= np.count_nonzero(df.get("status") == "denied")
denied = __(b)__.get("dependents").mean()
mean_denied = __(c)__.get("dependents").mean()
mean_approved return mean_denied - mean_approved
= np.array([])
stats for i in np.arange(10000):
= fast_stat(shuffle_all(apps))
stat = np.append(stats, stat) stats
```

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that `adelie_chinstrap`

is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – `'species'`

and `'mass'`

.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

Which of the following statements best describe the procedure above?

This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we replaced `line (a)`

with

```
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
```

and replaced line (b) with

```
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
```

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

0

\frac{1}{4}

\frac{1}{3}

\frac{2}{3}

\frac{3}{4}

1

Choose the best tool to answer each of the following questions. Note the following:

- By “hypothesis testing”, we mean “standard” hypothesis testing,
i.e. hypothesis testing that
**doesn’t**involve permutation testing or bootstrapping. - By “bootstrapping”, we mean bootstrapping that
**doesn’t**involve hypothesis testing.

Are incomes of applicants with 2 or fewer dependents drawn randomly from the distribution of incomes of all applicants?

Hypothesis Testing

Permutation Testing

Bootstrapping

What is the median income of credit card applicants with 2 or fewer dependents?

Hypothesis Testing

Permutation Testing

Bootstrapping

Are credit card applications approved through a random process in which 50% of applications are approved?

Hypothesis Testing

Permutation Testing

Bootstrapping

Is the median income of applicants with 2 or fewer dependents less than the median income of applicants with 3 or more dependents?

Hypothesis Testing

Permutation Testing

Bootstrapping

What is the difference in median income of applicants with 2 or fewer dependents and applicants with 3 or more dependents?

Hypothesis Testing

Permutation Testing

Bootstrapping