← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code `'LVA'`

). **In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.**

The DataFrame `plum`

contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of `plum`

are shown below (though the full DataFrame has 31 rows, not 5):

Each row in `plum`

corresponds to a single game. For each
game, we have:

`'Date'`

(`str`

), the date on which the game was played`'Opp'`

(`str`

), the three-letter code of the opponent team`'Home'`

(`bool`

),`True`

if the game was played in Las Vegas (“home”) and`False`

if it was played at the opponent’s arena (“away”)`'Won'`

(`bool`

),`True`

if the Las Vegas Aces won the game and`False`

if they lost`'PTS'`

(`int`

), the number of points Kelsey Plum scored in the game`'AST'`

(`int`

), the number of assists (passes) Kelsey Plum made in the game`'TOV'`

(`int`

), the number of turnovers Kelsey Plum made in the game (a turnover is when you lose the ball – turnovers are bad!)

Consider the definition of the function
`diff_in_group_means`

:

```
def diff_in_group_means(df, group_col, num_col):
= df.groupby(group_col).mean().get(num_col)
s return s.loc[False] - s.loc[True]
```

It turns out that Kelsey Plum averages 0.61 more assists in games that she wins than in games that she loses. After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:

**Null Hypothesis:**The number of assists Kelsey Plum makes in winning games and in losing games come from the same distribution.**Alternative Hypothesis:**The number of assists Kelsey Plum makes in winning games is higher on average than the number of assists that she makes in losing games.

To conduct our permutation test, we place the following code in a
`for`

-loop.

```
= plum.get('Won')
won = plum.get('AST')
ast = plum.assign(Won_shuffled=np.random.permutation(won)) \
shuffled =np.random.permutation(ast)) .assign(AST_shuffled
```

Which of the following options **does not** compute a
valid simulated test statistic for this permutation test?

`diff_in_group_means(shuffled, 'Won', 'AST')`

`diff_in_group_means(shuffled, 'Won', 'AST_shuffled')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST_shuffled')`

More than one of these options do not compute a valid simulated test statistic for this permutation test

**Answer:**
`diff_in_group_means(shuffled, 'Won', 'AST')`

`diff_in_group_means(shuffled, 'Won', 'AST')`

computes the
observed test statistic, which is -0.61. There is no randomness involved
in the observed test statistic; each time we run the line
`diff_in_group_means(shuffled, 'Won', 'AST')`

we will see the
same result, so this cannot be used for simulation.

To perform a permutation test here, we need to simulate under the null by randomly assigning assist counts to groups; here, the groups are “win” and “loss”.

**Option 2:**Here, assist counts are shuffled and the group names are kept in the same order. The end result is a random pairing of assists to groups.**Option 3:**Here, the group names are shuffled and the assist counts are kept in the same order. The end result is a random pairing of assist counts to groups.**Option 4:**Here, both the group names and assist counts are shuffled, but the end result is still the same as in the previous two options.

As such, Options 2 through 4 are all valid, and Option 1 is the only invalid one.

The average score on this problem was 68%.

Suppose we generate 10,000 simulated test statistics, using one of
the valid options from Question 1.1. The empirical distribution of test
statistics, with a red line at `observed_diff`

, is shown
below.

Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?

There is roughly a one quarter probability that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution.

The significance level of this hypothesis test is roughly a quarter.

Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging

**at least**0.61 more assists in wins than losses is roughly a quarter.Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging 0.61 more assists in wins than losses is roughly a quarter.

**Answer:** Under the assumption that Kelsey Plum’s
number of assists in winning games and in losing games come from the
same distribution, and that she wins 22 of the 31 games she plays, the
chance of her averaging **at least** 0.61 more assists in
wins than losses is roughly a quarter. (Option 3)

First, we should note that the area to the left of the red line (a quarter) is the p-value of our hypothesis test. Generally, the p-value is the probability of observing an outcome as or more extreme than the observed, under the assumption that the null hypothesis is true. The direction to look in depends on the alternate hypothesis; here, since our alternative hypothesis is that the number of assists Kelsey Plum makes in winning games is higher on average than in losing games, a “more extreme” outcome is where the assists in winning games are higher than in losing games, i.e. where \text{(assists in wins)} - \text{(assists in losses)} is positive or where \text{(assists in losses)} - \text{(assists in wins)} is negative. As mentioned in the solution to the first subpart, our test statistic is \text{(assists in losses)} - \text{(assists in wins)}, so a more extreme outcome is one where this is negative, i.e. to the left of the observed statistic.

Let’s first rule out the first two options.

**Option 1:**This option states that the probability that the null hypothesis (the number of assists she makes in winning and losing games comes from the same distribution) is true is roughly a quarter. However, the p-value**is not**the probability that the null hypothesis is true.**Option 2:**The significance level is the formal name for the p-value “cutoff” that we specify in our hypothesis test. There is no cutoff mentioned in the problem. The*observed*significance level is another name for the p-value, but Option 2 did not contain the word*observed*.

Now, the only difference between Options 3 and 4 is the inclusion of
“at least” in Option 3. Remember, to compute a p-value we must compute
the probability of observing something as **or more**
extreme than the observed, under the null. The “or more” corresponds to
“at least” in Option 3. As such, Option 3 is the correct choice.

The average score on this problem was 70%.

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame `breakdown`

has 4 rows and 50 columns – one
row for each of the 4 shot types mentioned above, and one column for
each of 50 different players. Each column of `breakdown`

describes the distribution of shot types for a single player.

The first few columns of `breakdown`

are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.

What is the **total variation distance** (TVD) between
Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution?
Give your answer as a **proportion** between 0 and 1 (not a
percentage) rounded to three decimal places.

**Answer:** 0.2

Recall, the TVD is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each category are as follows:

- Free Throws: |0.05 - 0.2| = 0.15
- Threes: |0.35 - 0.2| = 0.15
- Midrange: |0.35 - 0.3| = 0.05
- Layups: |0.25 - 0.3| = 0.05

Then, we have

\text{TVD} = \frac{1}{2} (0.15 + 0.15 + 0.05 + 0.05) = 0.2

The average score on this problem was 84%.

Recall, `breakdown`

has information for 50 different
players. We want to find the player whose shot distribution is the
**most similar to Kelsey Plum**, i.e. has the lowest TVD
with Kelsey Plum’s shot distribution.

Fill in the blanks below so that `most_sim_player`

evaluates to the name of the player with the most similar shot
distribution to Kelsey Plum. Assume that the column named
`'Kelsey Plum'`

is the first column in `breakdown`

(and again that `breakdown`

has 50 columns total).

```
= ''
most_sim_player = __(a)__
lowest_tvd_so_far = np.array(breakdown.columns).take(__(b)__)
other_players for player in other_players:
= tvd(breakdown.get('Kelsey Plum'),
player_tvd
breakdown.get(player))if player_tvd < lowest_tvd_so_far:
= player_tvd
lowest_tvd_so_far __(c)__
```

- What goes in blank (a)?

-1

-0.5

0

0.5

1

`np.array([])`

`''`

What goes in blank (b)?

What goes in blank (c)?

**Answers:** 1, `np.arange(1, 50)`

,
`most_sim_player = player`

Let’s try and understand the code provided to us. It appears that
we’re looping over the names of all other players, each time computing
the TVD between Kelsey Plum’s shot distribution and that player’s shot
distribution. If the TVD calculated in an iteration of the
`for`

-loop (`player_tvd`

) is less than the
previous lowest TVD (`lowest_tvd_so_far`

), the current player
(`player`

) is now the most “similar” to Kelsey Plum, and so
we store their TVD and name (in `most_sim_player`

).

Before the `for`

-loop, we haven’t looked at any other
players, so we don’t have values to store in
`most_sim_player`

and `lowest_tvd_so_far`

. On the
first iteration of the `for`

-loop, both of these values need
to be updated to reflect Kelsey Plum’s similarity with the first player
in `other_players`

. This is because, if we’ve only looked at
one player, that player is the most similar to Kelsey Plum.
`most_sim_player`

is already initialized as an empty string,
and we will specify how to “update” `most_sim_player`

in
blank (c). For blank (a), we need to pick a value of
`lowest_tvd_so_far`

that we can **guarantee**
will be updated on the first iteration of the `for`

-loop.
Recall, TVDs range from 0 to 1, with 0 meaning “most similar” and 1
meaning “most different”. This means that no matter what, the TVD
between Kelsey Plum’s distribution and the first player’s distribution
will be less than 1*, and so if we initialize
`lowest_tvd_so_far`

to 1 before the `for`

-loop, we
know it will be updated on the first iteration.

- It’s possible that the TVD between Kelsey Plum’s shot distribution
and the first other player’s shot distribution is equal to 1, rather
than being less than 1. If that were to happen, our code would still
generate the correct answer, but
`lowest_tvd_so_far`

and`most_sim_player`

wouldn’t be updated on the first iteration. Rather, they’d be updated on the first iteration where`player_tvd`

is strictly less than 1. (We’d expect that the TVDs between all pairs of players are neither exactly 0 nor exactly 1, so this is not a practical issue.) To avoid this issue entirely, we could change`if player_tvd < lowest_tvd_so_far`

to`if player_tvd <= lowest_tvd_so_far`

, which would make sure that even if the first TVD is 1, both`lowest_tvd_so_far`

and`most_sim_player`

are updated on the first iteration. - Note that we could have initialized
`lowest_tvd_so_far`

to a value larger than 1 as well. Suppose we initialized it to 55 (an arbitrary positive integer). On the first iteration of the`for`

-loop,`player_tvd`

will be less than 55, and so`lowest_tvd_so_far`

will be updated.

Then, we need `other_players`

to be an array containing
the names of all players other than Kelsey Plum, whose name is stored at
position 0 in `breakdown.columns`

. We are told that there are
50 players total, i.e. that there are 50 columns in
`breakdown`

. We want to `take`

the elements in
`breakdown.columns`

at positions 1, 2, 3, …, 49 (the last
element), and the call to `np.arange`

that generates this
sequence of positions is `np.arange(1, 50)`

. (Remember,
`np.arange(a, b)`

does not include the second integer!)

In blank (c), as mentioned in the explanation for blank (a), we need
to update the value of `most_sim_player`

. (Note that we only
arrive at this line if `player_tvd`

is the lowest pairwise
TVD we’ve seen so far.) All this requires is
`most_sim_player = player`

, since `player`

contains the name of the player who we are looking at in the current
iteration of the `for`

-loop.

The average score on this problem was 70%.

Let’s again consider the shot distributions of Kelsey Plum and Cheney Ogwumike.

We define the **maximum squared distance** (MSD) between
two categorical distributions as the **largest squared difference
between the proportions of any category**.

What is the MSD between Kelsey Plum’s shot distribution and Chiney
Ogwumike’s shot distribution? Give your answer as a
**proportion** between 0 and 1 (not a percentage) rounded
to three decimal places.

**Answer:** 0.023

Recall, in the solution to the first subpart of this problem, we calculated the absolute differences between the proportions of each category.

- Free Throws: |0.05 - 0.2| = 0.15
- Threes: |0.35 - 0.2| = 0.15
- Midrange: |0.35 - 0.3| = 0.05
- Layups: |0.25 - 0.3| = 0.05

The squared differences between the proportions of each category are computed by squaring the results in the list above (e.g. for Free Throws we’d have (0.05 - 0.2)^2 = 0.15^2). To find the maximum squared difference, then, all we need to do is find the largest of 0.15^2, 0.15^2, 0.05^2, and 0.05^2. Since 0.15 > 0.05, we have that the maximum squared distance is 0.15^2 = 0.0225, which rounds to 0.023.

The average score on this problem was 85%.

For your convenience, we show the first few columns of
`breakdown`

again below.

In basketball:

- layups are worth 2 points,
- midrange shots are worth 2 points,
- threes are worth 3 points, and
- free throws are worth 1 point

Suppose that Kelsey Plum is guaranteed to shoot exactly 10 shots a
game. The type of each shot is drawn from the `'Kelsey Plum'`

column of `breakdown`

(meaning that, for example, there is a
30% chance each shot is a layup).

Fill in the blanks below to complete the definition of the function
`simulate_points`

, which simulates the number of points
Kelsey Plum scores in a single game. (`simulate_points`

should return a single number.)

```
def simulate_points():
= np.random.multinomial(__(a)__, breakdown.get('Kelsey Plum'))
shots = np.array([2, 2, 3, 1])
possible_points return __(b)__
```

- What goes in blank (a)?
- What goes in blank (b)?

**Answers:** `10`

,
`(shots * possible_points).sum()`

To simulate the number of points Kelsey Plum scores in a single game, we need to:

- Simulate the number of shots she takes of each type (layups, midranges, threes, free throws).
- Using the simulated distribution in step 1, find the total number of points she scores – specifically, add 2 for every layup, 2 for every midrange, 3 for every three, and 1 for every free throw.

To simulate the number of shots she takes of each type, we use
`np.random.multinomial`

. This is because each shot,
independently of all other shots, has a 30% chance of being a layup, a
30% chance of being a midrange, and so on. What goes in blank (a) is the
number of shots she is taking in total; here, that is 10.
`shots`

will be an array of length 4 containing the number of
shots of each type - for instance, `shots`

may be
`np.array([3, 4, 2, 1])`

, which would mean she took 3 layups,
4 midranges, 2 threes, and 1 free throw.

Now that we have `shots`

, we need to factor in how many
points each type of shot is worth. This can be accomplished by
multiplying `shots`

with `possible_points`

, which
was already defined for us. Using the example where `shots`

is `np.array([3, 4, 2, 1])`

,
`shots * possible_points`

evaluates to
`np.array([6, 8, 6, 1])`

, which would mean she scored 6
points from layups, 8 points from midranges, and so on. Then, to find
the total number of points she scored, we need to compute the sum of
this array, either using the `np.sum`

function or
`.sum()`

method. As such, the two correct answers for blank
(b) are `(shots * possible_points).sum()`

and
`np.sum(shots * possible_points)`

.

The average score on this problem was 84%.

IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings.

An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
`app_data`

has a row for each product build that was logged
on the app. The columns are:

`'product'`

(`str`

): the name of the product, which includes the product line as the first word, followed by a description of the product`'category'`

(`str`

): a categorical description of the type of product`'assembly_time'`

(`str`

): the amount of time to assemble the product, formatted as`'x hr, y min'`

where`x`

and`y`

represent integers, possibly zero

The first few rows of `app_data`

are shown below, though
`app_data`

has many more rows than pictured (5000 rows
total).

Assume that we have already run `import babypandas as bpd`

and `import numpy as np`

.

You are browsing the IKEA showroom, deciding whether to purchase the
BILLY bookcase or the LOMMARP bookcase. You are concerned about the
amount of time it will take to assemble your new bookcase, so you look
up the assembly times reported in `app_data`

. Thinking of the
data in `app_data`

as a random sample of all IKEA purchases,
you want to perform a permutation test to test the following
hypotheses.

**Null Hypothesis**: The assembly time for the BILLY
bookcase and the assembly time for the LOMMARP bookcase come from the
same distribution.

**Alternative Hypothesis**: The assembly time for the
BILLY bookcase and the assembly time for the LOMMARP bookcase come from
different distributions.

Suppose we added a column to `app_data`

called
`'minutes'`

, containing the `'assembly_time'`

value for each entry converted to an integer amount of minutes. Then, we
query `app_data`

to keep only the BILLY bookcases, then
average the `'minutes'`

column. In addition, we separately
query `app_data`

to keep only the LOMMARP bookcases, then
average the `'minutes'`

column. If the null hypothesis is
true, which of the following statements about these two averages is
correct?

These two averages are the same.

Any difference between these two averages is due to random chance.

Any difference between these two averages cannot be ascribed to random chance alone.

The difference between these averages is statistically significant.

**Answer: ** Any difference between these two averages
is due to random chance.

If the null hypothesis is true, this means that the time recorded in
`app_data`

for each BILLY bookcase is a random number that
comes from some distribution, and the time recorded in
`app_data`

for each LOMMARP bookcase is a random number that
comes from *the same* distribution. Each assembly time is a
random number, so even if the null hypothesis is true, if we take one
person who assembles a BILLY bookcase and one person who assembles a
LOMMARP bookcase, there is no guarantee that their assembly times will
match. Their assembly times might match, or they might be different,
because assembly time is random. Randomness is the only reason that
their assembly times might be different, as the null hypothesis says
there is no systematic difference in assembly times between the two
bookcases. Specifically, it’s not the case that one typically takes
longer to assemble than the other.

With those points in mind, let’s go through the answer choices.

The first answer choice is incorrect. Just because two sets of
numbers are drawn from the same distribution, the numbers themselves
might be different due to randomness, and the averages might also be
different. Maybe just by chance, the people who assembled the BILLY
bookcases and recorded their times in `app_data`

were slower
on average than the people who assembled LOMMARP bookcases. If the null
hypothesis is true, this difference in average assembly time should be
small, but it very likely exists to some degree.

The second answer choice is correct. If the null hypothesis is true, the only reason for the difference is random chance alone.

The third answer choice is incorrect for the same reason that the second answer choice is correct. If the null hypothesis is true, any difference must be explained by random chance.

The fourth answer choice is incorrect. If there is a difference between the averages, it should be very small and not statistically significant. In other words, if we did a hypothesis test and the null hypothesis was true, we should fail to reject the null.

The average score on this problem was 77%.

For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.

Complete the code below to generate one simulated value of the test
statistic in a new way, without using
`np.random.permutation`

.

```
= (app_data.get('product') ==
billy 'BILLY Bookcase, white, 31 1/2x11x79 1/2')
= (app_data.get('product') ==
lommarp 'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
= app_data[billy|lommarp]
billy_lommarp = np.random.choice(billy_lommarp.get('minutes'), billy.sum()).mean()
billy_mean = _________
lommarp_mean - lommarp_mean billy_mean
```

What goes in the blank?

`billy_lommarp[lommarp].get('minutes').mean()`

`np.random.choice(billy_lommarp.get('minutes'), lommarp.sum()).mean()`

`billy_lommarp.get('minutes').mean() - billy_mean`

`(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()`

**Answer: **
`(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()`

The first line of code creates a boolean Series with a True value for
every BILLY bookcase, and the second line of code creates the analogous
Series for the LOMMARP bookcase. The third line queries to define a
DataFrame called `billy_lommarp`

containing all products that
are BILLY or LOMMARP bookcases. In other words, this DataFrame contains
a mix of BILLY and LOMMARP bookcases.

From this point, the way we would normally proceed in a permutation
test would be to use `np.random.permutation`

to shuffle one
of the two relevant columns (either `'product'`

or
`'minutes'`

) to create a random pairing of assembly times
with products. Then we would calculate the average of all assembly times
that were randomly assigned to the label BILLY. Similarly, we’d
calculate the average of all assembly times that were randomly assigned
to the label LOMMARP. Then we’d subtract these averages to get one
simulated value of the test statistic. To run the permutation test, we’d
have to repeat this process many times.

In this problem, we need to generate a simulated value of the test
statistic, without randomly shuffling one of the columns. The code
starts us off by defining a variable called `billy_mean`

that
comes from using `np.random.choice`

. There’s a lot going on
here, so let’s break it down. Remember that the first argument to
`np.random.choice`

is a sequence of values to choose from,
and the second is the number of random choices to make. The default
behavior is to make the random choices without replacement, so that no
element that has already been chosen can be chosen again. Here, we’re
making our random choices from the `'minutes'`

column of
`billy_lommarp`

. The number of choices to make from this
collection of values is `billy.sum()`

, which is the sum of
all values in the `billy`

Series defined in the first line of
code. The `billy`

Series contains True/False values, but in
Python, True counts as 1 and False counts as 0, so
`billy.sum()`

evaluates to the number of True entries in
`billy`

, which is the number of BILLY bookcases recorded in
`app_data`

. It helps to think of the random process like
this:

- Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
- Pull out a random assembly time from this bag.
- Repeat step 2, drawing as many times as there are BILLY bookcases, without replacement.

If we think of the random times we draw as being labeled BILLY, then the remaining assembly times still leftover in the bag represent the assembly times randomly labeled LOMMARP. In other words, this is a random association of assembly times to labels (BILLY or LOMMARP), which is the same thing we usually accomplish by shuffling in a permutation test.

From here, we can proceed the same way as usual. First, we need to
calculate the average of all assembly times that were randomly assigned
to the label BILLY. This is done for us and stored in
`billy_mean`

. We also need to calculate the average of all
assembly times that were randomly assigned the label LOMMARP. We’ll call
that `lommarp_mean`

. Thinking of picking times out of a large
bag, this is the average of all the assembly times left in the bag. The
problem is there is no easy way to access the assembly times that were
not picked. We can take advantage of the fact that we can easily
calculate the total assembly time of all BILLY and LOMMARP bookcases
together with `billy_lommarp.get('minutes').sum()`

. Then if
we subtract the total assembly time of all bookcases randomly labeled
BILLY, we’ll be left with the total assembly time of all bookcases
randomly labeled LOMMARP. That is,
`billy_lommarp.get('minutes').sum() - billy_mean * billy.sum()`

represents the total assembly time of all bookcases randomly labeled
LOMMARP. The count of the number of LOMMARP bookcases is given by
`lommarp.sum()`

so the average is
`(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()`

.

A common wrong answer for this question was the second answer choice,
`np.random.choice(billy_lommarp.get('minutes'), lommarp.sum()).mean()`

.
This mimics the structure of how `billy_mean`

was defined so
it’s a natural guess. However, this corresponds to the following random
process, which doesn’t associate each assembly with a unique label
(BILLY or LOMMARP):

- Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
- Pull out a random assembly time from this bag.
- Repeat, drawing as many times as there are BILLY bookcases, without replacement.
- Collect all the assembly times of any BILLY or LOMMARP bookcase in a large bag.
- Pull out a random assembly time from this bag.
- Repeat step 5, drawing as many times as there are LOMMARP bookcases, without replacement.

We could easily get the same assembly time once for BILLY and once for LOMMARP, while other assembly times could get picked for neither. This process doesn’t split the data into two random groups as desired.

The average score on this problem was 12%.

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that `adelie_chinstrap`

is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – `'species'`

and `'mass'`

.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

Which of the following statements best describe the procedure above?

This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

**Answer:** This is a permutation test, and our test
statistic is the difference in the mean Adelie mass and mean Chinstrap
mass (Option 4)

Recall, a permutation test helps us decide whether two random samples
come from the same distribution. This test matches our goal of testing
whether the masses of Adelie penguins and Chinstrap penguins are drawn
from the same population distribution. The code above are also doing
steps of a permutation test. In part (a), it shuffles
`'species'`

and stores the shuffled series to
`shuffled`

. In part (b), it assign the shuffled series of
values to `'species'`

column. Then, it uses
`grouped = with_shuffled.groupby('species').mean()`

to
calculate the mean of each species. In part (c), it computes the
difference between mean mass of the two species by first getting the
`'mass'`

column and then accessing mean mass of each group
(Adelie and Chinstrap) with positional index `0`

and
`1`

.

The average score on this problem was 98%.

Currently, line (c) (marked with a comment) uses .iloc. Which of the following options compute the exact same statistic as line (c) currently does?

Option 1:

`= grouped.get('mass').loc['Adelie'] - grouped.get('mass').loc['Chinstrap'] stat `

Option 2:

`= grouped.get('mass').loc['Chinstrap'] - grouped.get('mass').loc['Adelie'] stat `

Option 1 only

Option 2 only

Both options

Neither option

**Answer:** Option 1 only

We use `df.get(column_name).iloc[positional_index]`

to
access the value in a column with `positional_index`

.
Similarly, we use `df.get(column_name).loc[index]`

to access
value in a column with its `index`

. Remember
`grouped`

is a DataFrame that
`groupby('species')`

, so we have species name
`'Adelie'`

and `'Chinstrap'`

as index for
`grouped`

.

Option 2 is incorrect since it does subtraction in the reverse order
which results in a different `stat`

compared to
`line(c)`

. Its output will be -1
\cdot `stat`

. Recall, in
`grouped = with_shuffled.groupby('species').mean()`

, we use
`groupby()`

and since `'species'`

is a column with
string values, our index will be sorted in alphabetical order. So,
`.iloc[0]`

is `'Adelie'`

and `.iloc[1]`

is `'Chinstrap'`

.

The average score on this problem was 81%.

Is it possible to re-write `line (c)`

in a way that uses
`.iloc[0]`

twice, without any other uses of `.loc`

or `.iloc`

?

Yes, it’s possible

No, it’s not possible

**Answer:** Yes, it’s possible

There are multiple ways to achieve this. For instance
`stat = grouped.get('mass').iloc[0] - grouped.sort_index(ascending = False).get('mass').iloc[0]`

.

The average score on this problem was 64%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
as all values in the `stats`

array would be exactly the same
(Option 2)

Recall, `DataFrame.sample(n, replace = False)`

(or
`DataFrame.sample(n)`

since `replace = False`

is
by default) returns a DataFrame by randomly sampling `n`

rows
from the DataFrame, without replacement. Since our `n`

is
`adelie_chinstrap.shape[0]`

, and we are sampling without
replacement, we will get the exactly same Dataframe (though the order of
rows may be different but the `stats`

array would be exactly
the same).

The average score on this problem was 87%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
even though there would be several different values in the
`stats`

array (Option 3)

Recall, `DataFrame.sample(n, replace = True)`

returns a
new DataFrame by randomly sampling `n`

rows from the
DataFrame, with replacement. Since we are sampling with replacement, we
will have a DataFrame which produces a `stats`

array with
some different values. However, recall, the key idea behind a
permutation test is to shuffle the group labels. So, the above code does
not meet this key requirement since we only want to shuffle the
`"species"`

column without changing the size of the two
species. However, the code may change the size of the two species.

The average score on this problem was 66%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we replaced `line (a)`

with

```
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
```

and replaced line (b) with

```
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
```

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would still run a valid hypothesis test
(Option 1)

Our goal for the permutation test is to randomly assign birth weights
to groups, without changing group sizes. The above code shuffles
`'species'`

and `'mass'`

columns and assigns them
back to the DataFrame. This fulfills our goal.

The average score on this problem was 81%.

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

0

\frac{1}{4}

\frac{1}{3}

\frac{2}{3}

\frac{3}{4}

1

**Answer:** \frac{1}{3}

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. Thus, we compute the proportion of the test statistic that is equal or less than the observed statistic. (It is less than because less than corresponds to the alternative hypothesis “Chinstrap penguins weigh more on average than Adelie penguins”. Recall, when computing the statistic, we use Adelie’s mean mass minus Chinstrap’s mean mass. If Chinstrap’s mean mass is larger, the statistic will be negative, the direction of less than the observed statistic).

Thus, we look at the proportion of area less than or on the red line (which represents observed statistic), it is around \frac{1}{3}.

The average score on this problem was 80%.