← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code `'LVA'`

). **In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.**

The DataFrame `plum`

contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of `plum`

are shown below (though the full DataFrame has 31 rows, not 5):

Each row in `plum`

corresponds to a single game. For each
game, we have:

`'Date'`

(`str`

), the date on which the game was played`'Opp'`

(`str`

), the three-letter code of the opponent team`'Home'`

(`bool`

),`True`

if the game was played in Las Vegas (“home”) and`False`

if it was played at the opponent’s arena (“away”)`'Won'`

(`bool`

),`True`

if the Las Vegas Aces won the game and`False`

if they lost`'PTS'`

(`int`

), the number of points Kelsey Plum scored in the game`'AST'`

(`int`

), the number of assists (passes) Kelsey Plum made in the game`'TOV'`

(`int`

), the number of turnovers Kelsey Plum made in the game (a turnover is when you lose the ball – turnovers are bad!)

Consider the definition of the function
`diff_in_group_means`

:

```
def diff_in_group_means(df, group_col, num_col):
= df.groupby(group_col).mean().get(num_col)
s return s.loc[False] - s.loc[True]
```

It turns out that Kelsey Plum averages 0.61 more assists in games that she wins than in games that she loses. After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:

**Null Hypothesis:**The number of assists Kelsey Plum makes in winning games and in losing games come from the same distribution.**Alternative Hypothesis:**The number of assists Kelsey Plum makes in winning games is higher on average than the number of assists that she makes in losing games.

To conduct our permutation test, we place the following code in a
`for`

-loop.

```
= plum.get('Won')
won = plum.get('AST')
ast = plum.assign(Won_shuffled=np.random.permutation(won)) \
shuffled =np.random.permutation(ast)) .assign(AST_shuffled
```

Which of the following options **does not** compute a
valid simulated test statistic for this permutation test?

`diff_in_group_means(shuffled, 'Won', 'AST')`

`diff_in_group_means(shuffled, 'Won', 'AST_shuffled')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST')`

`diff_in_group_means(shuffled, 'Won_shuffled, 'AST_shuffled')`

More than one of these options do not compute a valid simulated test statistic for this permutation test

Suppose we generate 10,000 simulated test statistics, using one of
the valid options from Question 1.1. The empirical distribution of test
statistics, with a red line at `observed_diff`

, is shown
below.

Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?

There is roughly a one quarter probability that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution.

The significance level of this hypothesis test is roughly a quarter.

Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging

**at least**0.61 more assists in wins than losses is roughly a quarter.Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging 0.61 more assists in wins than losses is roughly a quarter.

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame `breakdown`

has 4 rows and 50 columns – one
row for each of the 4 shot types mentioned above, and one column for
each of 50 different players. Each column of `breakdown`

describes the distribution of shot types for a single player.

The first few columns of `breakdown`

are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.

What is the **total variation distance** (TVD) between
Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution?
Give your answer as a **proportion** between 0 and 1 (not a
percentage) rounded to three decimal places.

Recall, `breakdown`

has information for 50 different
players. We want to find the player whose shot distribution is the
**most similar to Kelsey Plum**, i.e. has the lowest TVD
with Kelsey Plum’s shot distribution.

Fill in the blanks below so that `most_sim_player`

evaluates to the name of the player with the most similar shot
distribution to Kelsey Plum. Assume that the column named
`'Kelsey Plum'`

is the first column in `breakdown`

(and again that `breakdown`

has 50 columns total).

```
= ''
most_sim_player = __(a)__
lowest_tvd_so_far = np.array(breakdown.columns).take(__(b)__)
other_players for player in other_players:
= tvd(breakdown.get('Kelsey Plum'),
player_tvd
breakdown.get(player))if player_tvd < lowest_tvd_so_far:
= player_tvd
lowest_tvd_so_far __(c)__
```

- What goes in blank (a)?

-1

-0.5

0

0.5

1

`np.array([])`

`''`

What goes in blank (b)?

What goes in blank (c)?

Let’s again consider the shot distributions of Kelsey Plum and Cheney Ogwumike.

We define the **maximum squared distance** (MSD) between
two categorical distributions as the **largest squared difference
between the proportions of any category**.

What is the MSD between Kelsey Plum’s shot distribution and Chiney
Ogwumike’s shot distribution? Give your answer as a
**proportion** between 0 and 1 (not a percentage) rounded
to three decimal places.

For your convenience, we show the first few columns of
`breakdown`

again below.

In basketball:

- layups are worth 2 points,
- midrange shots are worth 2 points,
- threes are worth 3 points, and
- free throws are worth 1 point

Suppose that Kelsey Plum is guaranteed to shoot exactly 10 shots a
game. The type of each shot is drawn from the `'Kelsey Plum'`

column of `breakdown`

(meaning that, for example, there is a
30% chance each shot is a layup).

Fill in the blanks below to complete the definition of the function
`simulate_points`

, which simulates the number of points
Kelsey Plum scores in a single game. (`simulate_points`

should return a single number.)

```
def simulate_points():
= np.random.multinomial(__(a)__, breakdown.get('Kelsey Plum'))
shots = np.array([2, 2, 3, 1])
possible_points return __(b)__
```

- What goes in blank (a)?
- What goes in blank (b)?

IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings.

An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
`app_data`

has a row for each product build that was logged
on the app. The columns are:

`'product'`

(`str`

): the name of the product, which includes the product line as the first word, followed by a description of the product`'category'`

(`str`

): a categorical description of the type of product`'assembly_time'`

(`str`

): the amount of time to assemble the product, formatted as`'x hr, y min'`

where`x`

and`y`

represent integers, possibly zero

The first few rows of `app_data`

are shown below, though
`app_data`

has many more rows than pictured (5000 rows
total).

Assume that we have already run `import babypandas as bpd`

and `import numpy as np`

.

You are browsing the IKEA showroom, deciding whether to purchase the
BILLY bookcase or the LOMMARP bookcase. You are concerned about the
amount of time it will take to assemble your new bookcase, so you look
up the assembly times reported in `app_data`

. Thinking of the
data in `app_data`

as a random sample of all IKEA purchases,
you want to perform a permutation test to test the following
hypotheses.

**Null Hypothesis**: The assembly time for the BILLY
bookcase and the assembly time for the LOMMARP bookcase come from the
same distribution.

**Alternative Hypothesis**: The assembly time for the
BILLY bookcase and the assembly time for the LOMMARP bookcase come from
different distributions.

Suppose we added a column to `app_data`

called
`'minutes'`

, containing the `'assembly_time'`

value for each entry converted to an integer amount of minutes. Then, we
query `app_data`

to keep only the BILLY bookcases, then
average the `'minutes'`

column. In addition, we separately
query `app_data`

to keep only the LOMMARP bookcases, then
average the `'minutes'`

column. If the null hypothesis is
true, which of the following statements about these two averages is
correct?

These two averages are the same.

Any difference between these two averages is due to random chance.

Any difference between these two averages cannot be ascribed to random chance alone.

The difference between these averages is statistically significant.

For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.

Complete the code below to generate one simulated value of the test
statistic in a new way, without using
`np.random.permutation`

.

```
= (app_data.get('product') ==
billy 'BILLY Bookcase, white, 31 1/2x11x79 1/2')
= (app_data.get('product') ==
lommarp 'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
= app_data[billy|lommarp]
billy_lommarp = np.random.choice(billy_lommarp.get('minutes'), billy.sum()).mean()
billy_mean = _________
lommarp_mean - lommarp_mean billy_mean
```

What goes in the blank?

`billy_lommarp[lommarp].get('minutes').mean()`

`np.random.choice(billy_lommarp.get('minutes'), lommarp.sum()).mean()`

`billy_lommarp.get('minutes').mean() - billy_mean`

`(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()`

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that `adelie_chinstrap`

is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – `'species'`

and `'mass'`

.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

Which of the following statements best describe the procedure above?

This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

Currently, line (c) (marked with a comment) uses .iloc. Which of the following options compute the exact same statistic as line (c) currently does?

Option 1:

`= grouped.get('mass').loc['Adelie'] - grouped.get('mass').loc['Chinstrap'] stat `

Option 2:

`= grouped.get('mass').loc['Chinstrap'] - grouped.get('mass').loc['Adelie'] stat `

Option 1 only

Option 2 only

Both options

Neither option

Is it possible to re-write `line (c)`

in a way that uses
`.iloc[0]`

twice, without any other uses of `.loc`

or `.iloc`

?

Yes, it’s possible

No, it’s not possible

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we replaced `line (a)`

with

```
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
```

and replaced line (b) with

```
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
```

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

0

\frac{1}{4}

\frac{1}{3}

\frac{2}{3}

\frac{3}{4}

1