← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that `adelie_chinstrap`

is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – `'species'`

and `'mass'`

.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

Which of the following statements best describe the procedure above?

This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

**Answer:** This is a permutation test, and our test
statistic is the difference in the mean Adelie mass and mean Chinstrap
mass (Option 4)

Recall, a permutation test helps us decide whether two random samples
come from the same distribution. This test matches our goal of testing
whether the masses of Adelie penguins and Chinstrap penguins are drawn
from the same population distribution. The code above are also doing
steps of a permutation test. In part (a), it shuffles
`'species'`

and stores the shuffled series to
`shuffled`

. In part (b), it assign the shuffled series of
values to `'species'`

column. Then, it uses
`grouped = with_shuffled.groupby('species').mean()`

to
calculate the mean of each species. In part (c), it computes the
difference between mean mass of the two species by first getting the
`'mass'`

column and then accessing mean mass of each group
(Adelie and Chinstrap) with positional index `0`

and
`1`

.

The average score on this problem was 98%.

Currently, line (c) (marked with a comment) uses .iloc. Which of the following options compute the exact same statistic as line (c) currently does?

Option 1:

`= grouped.get('mass').loc['Adelie'] - grouped.get('mass').loc['Chinstrap'] stat `

Option 2:

`= grouped.get('mass').loc['Chinstrap'] - grouped.get('mass').loc['Adelie'] stat `

Option 1 only

Option 2 only

Both options

Neither option

**Answer:** Option 1 only

We use `df.get(column_name).iloc[positional_index]`

to
access the value in a column with `positional_index`

.
Similarly, we use `df.get(column_name).loc[index]`

to access
value in a column with its `index`

. Remember
`grouped`

is a DataFrame that
`groupby('species')`

, so we have species name
`'Adelie'`

and `'Chinstrap'`

as index for
`grouped`

.

Option 2 is incorrect since it does subtraction in the reverse order
which results in a different `stat`

compared to
`line(c)`

. Its output will be -1
\cdot `stat`

. Recall, in
`grouped = with_shuffled.groupby('species').mean()`

, we use
`groupby()`

and since `'species'`

is a column with
string values, our index will be sorted in alphabetical order. So,
`.iloc[0]`

is `'Adelie'`

and `.iloc[1]`

is `'Chinstrap'`

.

The average score on this problem was 81%.

Is it possible to re-write `line (c)`

in a way that uses
`.iloc[0]`

twice, without any other uses of `.loc`

or `.iloc`

?

Yes, it’s possible

No, it’s not possible

**Answer:** Yes, it’s possible

There are multiple ways to achieve this. For instance
`stat = grouped.get('mass').iloc[0] - grouped.sort_index(ascending = False).get('mass').iloc[0]`

.

The average score on this problem was 64%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
as all values in the `stats`

array would be exactly the same
(Option 2)

Recall, `DataFrame.sample(n, replace = False)`

(or
`DataFrame.sample(n)`

since `replace = False`

is
by default) returns a DataFrame by randomly sampling `n`

rows
from the DataFrame, without replacement. Since our `n`

is
`adelie_chinstrap.shape[0]`

, and we are sampling without
replacement, we will get the exactly same Dataframe (though the order of
rows may be different but the `stats`

array would be exactly
the same).

The average score on this problem was 87%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
even though there would be several different values in the
`stats`

array (Option 3)

Recall, `DataFrame.sample(n, replace = True)`

returns a
new DataFrame by randomly sampling `n`

rows from the
DataFrame, with replacement. Since we are sampling with replacement, we
will have a DataFrame which produces a `stats`

array with
some different values. However, recall, the key idea behind a
permutation test is to shuffle the group labels. So, the above code does
not meet this key requirement since we only want to shuffle the
`"species"`

column without changing the size of the two
species. However, the code may change the size of the two species.

The average score on this problem was 66%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we replaced `line (a)`

with

```
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
```

and replaced line (b) with

```
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
```

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would still run a valid hypothesis test
(Option 1)

Our goal for the permutation test is to randomly assign birth weights
to groups, without changing group sizes. The above code shuffles
`'species'`

and `'mass'`

columns and assigns them
back to the DataFrame. This fulfills our goal.

The average score on this problem was 81%.

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

0

\frac{1}{4}

\frac{1}{3}

\frac{2}{3}

\frac{3}{4}

1

**Answer:** \frac{1}{3}

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. Thus, we compute the proportion of the test statistic that is equal or less than the observed statistic. (It is less than because less than corresponds to the alternative hypothesis “Chinstrap penguins weigh more on average than Adelie penguins”. Recall, when computing the statistic, we use Adelie’s mean mass minus Chinstrap’s mean mass. If Chinstrap’s mean mass is larger, the statistic will be negative, the direction of less than the observed statistic).

Thus, we look at the proportion of area less than or on the red line (which represents observed statistic), it is around \frac{1}{3}.

The average score on this problem was 80%.

For this question we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

We have access to the `season`

DataFrame, which contains
statistics on all players in the WNBA in the 2021 season. The first few
rows of `season`

are shown below.

Each row in `season`

corresponds to a single player. For
each player, we have: - `'Player'`

(`str`

), their
name - `'Team'`

(`str`

), the three-letter code of
the team they play on - `'G'`

(`int`

), the number
of games they played in the 2021 season - `'PPG'`

(`float`

), the number of points they scored per game played -
`'APG'`

(`float`

), the number of assists (passes)
they made per game played - `'TPG'`

(`float`

), the
number of turnovers they made per game played

Note that all of the numerical columns in `season`

must
contain values that are greater than or equal to 0.

Suppose we only have access to the DataFrame
`small_season`

, which is a random sample of **size
36** from `season`

. We’re interested in learning about
the true mean points per game of all players in `season`

given just the information in `small_season`

.

To start, we want to bootstrap `small_season`

10,000 times
and compute the mean of the resample each time. We want to store these
10,000 bootstrapped means in the array `boot_means`

.

Here is a broken implementation of this procedure.

```
= np.array([])
boot_means for i in np.arange(10000):
= small_season.sample(season.shape[0], replace=False) # Line 1
resample = small_season.get('PPG').mean() # Line 2
resample_mean # Line 3 np.append(boot_means, new_mean)
```

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.

What is incorrect about Line 1? Select all that apply.

Currently the procedure samples from

`small_season`

, when it should be sampling from`season`

The sample size is

`season.shape[0]`

, when it should be`small_season.shape[0]`

Sampling is currently being done without replacement, when it should be done with replacement

Line 1 is correct as-is

**Answers:**

- The sample size is
`season.shape[0]`

, when it should be`small_season.shape[0]`

- Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from `small_season`

. When
bootstrapping, we **sample with replacement** from our
original sample, with a sample size that’s equal to the original
sample’s size. Here, our original sample is `small_season`

,
so we should be taking samples of size
`small_season.shape[0]`

from it.

Option 1 is incorrect; `season`

has nothing to do with
this problem, as we are bootstrapping from
`small_season`

.

The average score on this problem was 95%.

What is incorrect about Line 2? Select all that apply.

Currently it is taking the mean of the

`'PPG'`

column in`small_season`

, when it should be taking the mean of the`'PPG'`

column in`season`

Currently it is taking the mean of the

`'PPG'`

column in`small_season`

, when it should be taking the mean of the`'PPG'`

column in`resample`

`.mean()`

is not a valid Series method, and should be replaced with a call to the function`np.mean`

Line 2 is correct as-is

**Answer:** Currently it is taking the mean of the
`'PPG'`

column in `small_season`

, when it should
be taking the mean of the `'PPG'`

column in
`resample`

The current implementation of Line 2 doesn’t use the
`resample`

at all, when it should. If we were to leave Line 2
as it is, all of the values in `boot_means`

would be
identical (and equal to the mean of the `'PPG'`

column in
`small_season`

).

Option 1 is incorrect since our bootstrapping procedure is
independent of `season`

. Option 3 is incorrect because
`.mean()`

is a valid Series method.

The average score on this problem was 98%.

What is incorrect about Line 3? Select all that apply.

The result of calling

`np.append`

is not being reassigned to`boot_means`

, so`boot_means`

will be an empty array after running this procedureThe indentation level of the line is incorrect –

`np.append`

should be outside of the`for`

-loop (and aligned with`for i`

)`new_mean`

is not a defined variable name, and should be replaced with`resample_mean`

Line 3 is correct as-is

**Answers:**

- The result of calling
`np.append`

is not being reassigned to`boot_means`

, so`boot_means`

will be an empty array after running this procedure `new_mean`

is not a defined variable name, and should be replaced with`resample_mean`

`np.append`

returns a new array and does not modify the
array it is called on (`boot_means`

, in this case), so Option
1 is a necessary fix. Furthermore, Option 3 is a necessary fix since
`new_mean`

wasn’t defined anywhere.

Option 2 is incorrect; if `np.append`

were outside of the
`for`

-loop, none of the 10,000 resampled means would be saved
in `boot_means`

.

The average score on this problem was 94%.

IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings.

An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
`app_data`

has a row for each product build that was logged
on the app. The columns are:

`'product'`

(`str`

): the name of the product, which includes the product line as the first word, followed by a description of the product`'category'`

(`str`

): a categorical description of the type of product`'assembly_time'`

(`str`

): the amount of time to assemble the product, formatted as`'x hr, y min'`

where`x`

and`y`

represent integers, possibly zero

The first few rows of `app_data`

are shown below, though
`app_data`

has many more rows than pictured (5000 rows
total).

Assume that we have already run `import babypandas as bpd`

and `import numpy as np`

.

We want to use `app_data`

to estimate the average amount
of time it takes to build an IKEA bed (any product in the ‘bed’
category). Which of the following strategies would be an appropriate way
to estimate this quantity? Select all that apply.

Query to keep only the beds. Then resample with replacement many times. For each resample, take the mean of the

`'minutes'`

column. Compute a 95% confidence interval based on those means.Query to keep only the beds. Group by

`'product'`

using the mean aggregation function. Then resample with replacement many times. For each resample, take the mean of the`'minutes'`

column. Compute a 95% confidence interval based on those means.Resample with replacement many times. For each resample, first query to keep only the beds and then take the mean of the

`'minutes'`

column. Compute a 95% confidence interval based on those means.Resample with replacement many times. For each resample, first query to keep only the beds. Then group by

`'product'`

using the mean aggregation function, and finally take the mean of the`'minutes'`

column. Compute a 95% confidence interval based on those means.

**Answer: **

Only the first answer is correct. This is a question of parameter estimation, so our approach is to use bootstrapping to create many resamples of our original sample, computing the average of each resample. Each resample should always be the same size as the original sample. The first answer choice accomplishes this by querying first to keep only the beds, then resampling from the DataFrame of beds only. This means resamples will have the same size as the original sample. Each resample’s mean will be computed, so we will have many resample means from which to construct our 95% confidence interval.

In the second answer choice, we are actually taking the mean twice.
We first average the build times for all builds of the same product when
grouping by product. This produces a DataFrame of different products
with the average build time for each. We then resample from this
DataFrame, computing the average of each resample. But this is a
resample of products, not of product builds. The size of the resample is
the number of unique products in `app_data`

, not the number
of reported product builds in `app_data`

. Further, we get
incorrect results by averaging numbers that are already averages. For
example, if 5 people build bed A and it takes them each 1 hour, and 1
person builds bed B and it takes them 10 hours, the average amount of
time to build a bed is \frac{5*1+10}{6} =
2.5. But if we average the times for bed A (1 hour) and average
the times for bed B (5 hours), then average those, we get \frac{1+5}{2} = 3, which is not the same.
More generally, grouping is not a part of the bootstrapping process
because we want each data value to be weighted equally.

The last two answer choices are incorrect because they involve
resampling from the full `app_data`

DataFrame before querying
to keep only the beds. This is incorrect because it does not preserve
the sample size. For example, if `app_data`

contains 1000
reported bed builds and 4000 other product builds, then the only
relevant data is the 1000 bed build times, so when we resample, we want
to consider another set of 1000 beds. If we resample from the full
`app_data`

DataFrame, our resample will contain 5000 rows,
but the number of beds will be random, not necessarily 1000. If we query
first to keep only the beds, then resample, our resample will contain
exactly 1000 beds every time. As an added bonus, since we only care
about beds, it’s much faster to resample from a smaller DataFrame of
beds only than it is to resample from all `app_data`

with
plenty of rows we don’t care about.

The average score on this problem was 71%.

**True or False**: Suppose that from a sample, you
compute a 95% bootstrapped confidence interval for a population
parameter to be the interval [L, R]. Then the average of L and R is the
mean of the original sample.

**Answer: ** False

False, a 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Note that the problem specifies that the confidence interval is bootstrapped. Since the interval is found using bootstrapping, L anr R averaged will not be the mean of the original sample since the mean of the original sample is not what is used in calculating the bootstrapped confidence interval. The bootstrapped confidence interval (as noted on the reference sheet is created by re-sampling the data with replacement over and over again). Thus, the interval is not centered around the mean because bootstrapping and random sampling will lead to variation in the confidence interval.

The average score on this problem was 87%.

Suppose Tiffany has a random sample of dogs. Select the most appropriate technique to answer each of the following questions using Tiffany’s dog sample.

Do small dogs typically live longer than medium and large dogs?

Standard hypothesis test

Permutation test

Bootstrapping

**Answer: ** Option 2: Permutation test.

We have two parameters: dog size and life expectancy. Here if there was no significant statistical difference between the life expectancy of different dog sizes, randomly assigning our sampled life expectancy to each dog should lead us to observe similar observations to the observed statistic. Thus using a permutation test to comapre the two groups makes the most sense. We’re not really trying to estimate a spcecific value so bootstrapping isn’t a good idea here. Also, there’s not really a good way to randomly generate life expectancies so a hypothesis test is not a good idea here.

The average score on this problem was 77%.

Does Tiffany’s sample have an even distribution of dog kinds?

Standard hypothesis test

Permutation test

Bootstrapping

**Answer: ** Option 1: Standard hypothesis test.

We’re not really comparing a variable between two groups, but rather looking at the overall distribution, so Permutation testing wouldn’t work too well here. Again, we’re not really trying to estimate anything here so bootstrapping isn’t a good idea. This leaves us with the Standard Hypothesis Test, which makes sense if we use Total Variation Distance as our test statistic.

The average score on this problem was 51%.

What’s the median weight for herding dogs?

Standard hypothesis test

Permutation test

Bootstrapping

**Answer: ** Option 3: Bootstrapping

Here we’re trying to determine a specific value, which immediately leads us to bootstrapping. The other two tests wouldn’t really make sense in this context.

The average score on this problem was 83%.

Do dogs live longer than 12 years on average?

Standard hypothesis test

Permutation test

Bootstrapping

**Answer: ** Option 3: Bootstrapping

While the wording here might throw us off, we’re really just trying to determine the average life expectancy of dogs, and then see how that compares to 12. This leads us to bootstrapping since we’re trying to determine a specific value. The other two tests wouldn’t really make sense in this context.

The average score on this problem was 43%.