← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

Given below is the `season`

DataFrame, which contains
statistics on all players in the WNBA in the 2021 season. The first few
rows of `season`

are shown below:

Each row in season corresponds to a single player. In this problem,
we’ll be looking at the `'PPG'`

column, which records the
number of points scored per game played.

Now, suppose we only have access to the DataFrame
`small_season`

, which is a random sample of **size
36** from `season`

. We’re interested in learning about
the true mean points per game of all players in `season`

given just the information in `small_season`

.

To start, we want to bootstrap `small_season`

10,000 times
and compute the mean of the resample each time. We want to store these
10,000 bootstrapped means in the array `boot_means`

.

Here is a broken implementation of this procedure.

```
= np.array([])
boot_means for i in np.arange(10000):
= small_season.sample(season.shape[0], replace=False) # Line 1
resample = small_season.get('PPG').mean() # Line 2
resample_mean # Line 3 np.append(boot_means, new_mean)
```

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.

What is incorrect about Line 1? Select all that apply.

Currently the procedure samples from

`small_season`

, when it should be sampling from`season`

The sample size is

`season.shape[0]`

, when it should be`small_season.shape[0]`

Sampling is currently being done without replacement, when it should be done with replacement

Line 1 is correct as-is

**Answers:**

- The sample size is
`season.shape[0]`

, when it should be`small_season.shape[0]`

- Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from `small_season`

. When
bootstrapping, we **sample with replacement** from our
original sample, with a sample size that’s equal to the original
sample’s size. Here, our original sample is `small_season`

,
so we should be taking samples of size
`small_season.shape[0]`

from it.

Option 1 is incorrect; `season`

has nothing to do with
this problem, as we are bootstrapping from
`small_season`

.

The average score on this problem was 95%.

What is incorrect about Line 2? Select all that apply.

Currently it is taking the mean of the

`'PPG'`

column in`small_season`

, when it should be taking the mean of the`'PPG'`

column in`season`

Currently it is taking the mean of the

`'PPG'`

column in`small_season`

, when it should be taking the mean of the`'PPG'`

column in`resample`

`.mean()`

is not a valid Series method, and should be replaced with a call to the function`np.mean`

Line 2 is correct as-is

**Answer:** Currently it is taking the mean of the
`'PPG'`

column in `small_season`

, when it should
be taking the mean of the `'PPG'`

column in
`resample`

The current implementation of Line 2 doesn’t use the
`resample`

at all, when it should. If we were to leave Line 2
as it is, all of the values in `boot_means`

would be
identical (and equal to the mean of the `'PPG'`

column in
`small_season`

).

Option 1 is incorrect since our bootstrapping procedure is
independent of `season`

. Option 3 is incorrect because
`.mean()`

is a valid Series method.

The average score on this problem was 98%.

What is incorrect about Line 3? Select all that apply.

The result of calling

`np.append`

is not being reassigned to`boot_means`

, so`boot_means`

will be an empty array after running this procedureThe indentation level of the line is incorrect –

`np.append`

should be outside of the`for`

-loop (and aligned with`for i`

)`new_mean`

is not a defined variable name, and should be replaced with`resample_mean`

Line 3 is correct as-is

**Answers:**

- The result of calling
`np.append`

is not being reassigned to`boot_means`

, so`boot_means`

will be an empty array after running this procedure `new_mean`

is not a defined variable name, and should be replaced with`resample_mean`

`np.append`

returns a new array and does not modify the
array it is called on (`boot_means`

, in this case), so Option
1 is a necessary fix. Furthermore, Option 3 is a necessary fix since
`new_mean`

wasn’t defined anywhere.

Option 2 is incorrect; if `np.append`

were outside of the
`for`

-loop, none of the 10,000 resampled means would be saved
in `boot_means`

.

The average score on this problem was 94%.

We construct a 95% confidence interval for the true mean points per game for all players by taking the middle 95% of the bootstrapped sample means.

```
= np.percentile(boot_means, 2.5)
left_b = np.percentile(boot_means, 97.5)
right_b = [left_b, right_b] boot_ci
```

We find that `boot_ci`

is the interval [7.7, 10.3].
However, the mean points per game in `season`

is 7, which is
not in the interval we found. Which of the following statements is true?
(Select all question)

95% of games in

`season`

have a number of points between 7.7 and 10.3.95% of values in

`boot_means`

fall between 7.7 and 10.3.There is a 95% chance that the true mean points per game is between 7.7 and 10.3.

The interval we created did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of them would contain the true mean points per game.

The interval we created did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

**Answers:**

- 95% of values in
`boot_means`

fall between the endpoints of the interval we found. - The interval we created did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

The first option is incorrect because the confidence interval
describes what we think the *mean* points per game could be.
Individual games likely have a very large variety in the number of
points scores. Probably very few have between 7.7 and 10.3 points.

The second option is correct because this is precisely how we
calculated the endpoints of our interval, by taking the middle 95% of
values in `boot_means`

.

The third option is incorrect because we know the true mean points per game - it’s 7. 7 does not fall in the interval 7.7 to 10.3, and we can say that with certainty. This is not a probability statement because the interval and the parameter are both fixed.

The fourth option is incorrect because of the word *exactly*.
We generally can’t make guarantees like this when working with
randomness.

The fifth option is correct, as this is the meaning of confidence. We have confidence in the process of generating 95% confidence intervals, because roughly 95% of such intervals we create will capture the parameter of interest.

**True or False**: Suppose that from a sample, you
compute a 95% bootstrapped confidence interval for a population
parameter to be the interval [L, R]. Then the average of L and R is the
mean of the original sample.

**Answer: ** False

A 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Note that the problem specifies that the confidence interval is bootstrapped. Since the interval is found using bootstrapping, L and R averaged will not be the mean of the original sample since the mean of the original sample is not what is used in calculating the bootstrapped confidence interval. The bootstrapped confidence interval is created by re-sampling the data with replacement over and over again. Thus, the interval is not centered around the mean because bootstrapping and random sampling will lead to variation in the confidence interval. Additionally, L is the 2.5th percentile of the distribution of bootstrapped means and R is the 97.5th percentile, and these are not necessarily the same distance away from the mean of the sample.

The average score on this problem was 87%.

```
= np.array([])
results for i in np.arange(10):
= np.random.choice(np.arange(1000), replace=False)
result = np.append(results, result) results
```

After this code executes, `results`

contains:

a simple random sample of size 9, chosen from a set of size 999 with replacement

a simple random sample of size 9, chosen from a set of size 999 without replacement

a simple random sample of size 10, chosen from a set of size 1000 with replacement

a simple random sample of size 10, chosen from a set of size 1000 without replacement

**Answer: ** a simple random sample of size 10, chosen
from a set of size 1000 with replacement

Let’s see what the code is doing. The first line initializes an empty
array called `results`

. The for loop runs 10 times. Each
time, it creates a value called `result`

by some process
we’ll inspect shortly and appends this value to the end of the
`results`

array. At the end of the code snippet,
`results`

will be an array containing 10 elements.

Now, let’s look at the process by which each element
`result`

is generated. Each `result`

is a random
element chosen from `np.arange(1000)`

which is the numbers
from 0 to 999, inclusive. That’s 1000 possible numbers. Each time
`np.random.choice`

is called, just one value is chosen from
this set of 1000 possible numbers.

When we sample just one element from a set of values, sampling with replacement is the same as sampling without replacement, because sampling with or without replacement concerns whether subsequent draws can be the same as previous ones. When we’re just sampling one element, it really doesn’t matter whether our process involves putting that element back, as we’re not going to draw again!

Therefore, `result`

is just one random number chosen from
the 1000 possible numbers. Each time the `for`

loop executes,
`result`

gets set to a random number chosen from the 1000
possible numbers. It is possible (though unlikely) that the random
`result`

of the first execution of the loop matches the
`result`

of the second execution of the loop. More generally,
there can be repeated values in the `results`

array since
each entry of this array is independently drawn from the same set of
possibilities. Since repetitions are possible, this means the sample is
drawn with replacement.

Therefore, the `results`

array contains a sample of size
10 chosen from a set of size 1000 with replacement. This is called a
“simple random sample” because each possible sample of 10 values is
equally likely, which comes from the fact that
`np.random.choice`

chooses each possible value with equal
probability by default.

The average score on this problem was 11%.

An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
`app_data`

has a row for each product build that was logged
on the app. The columns are:

`'product'`

(`str`

): the name of the product, which includes the product line as the first word, followed by a description of the product`'category'`

(`str`

): a categorical description of the type of product`'assembly_time'`

(`str`

): the amount of time to assemble the product, formatted as`'x hr, y min'`

where`x`

and`y`

represent integers, possibly zero`'minutes'`

(`int`

): integer values representing the number of minutes it took to assemble each product

We want to use `app_data`

to estimate the average amount
of time it takes to build an IKEA bed (any product in the
`'bed'`

category). Which of the following strategies would be
an appropriate way to estimate this quantity? Select all that apply.

Query to keep only the beds. Then resample with replacement many times. For each resample, take the mean of the

`'minutes'`

column. Compute a 95% confidence interval based on those means.Query to keep only the beds. Group by

`'product'`

using the mean aggregation function. Then resample with replacement many times. For each resample, take the mean of the`'minutes'`

column. Compute a 95% confidence interval based on those means.Resample with replacement many times. For each resample, first query to keep only the beds and then take the mean of the

`'minutes'`

column. Compute a 95% confidence interval based on those means.Resample with replacement many times. For each resample, first query to keep only the beds. Then group by

`'product'`

using the mean aggregation function, and finally take the mean of the`'minutes'`

column. Compute a 95% confidence interval based on those means.

**Answer: **

Only the first answer is correct. This is a question of parameter estimation, so our approach is to use bootstrapping to create many resamples of our original sample, computing the average of each resample. Each resample should always be the same size as the original sample. The first answer choice accomplishes this by querying first to keep only the beds, then resampling from the DataFrame of beds only. This means resamples will have the same size as the original sample. Each resample’s mean will be computed, so we will have many resample means from which to construct our 95% confidence interval.

In the second answer choice, we are actually taking the mean twice.
We first average the build times for all builds of the same product when
grouping by product. This produces a DataFrame of different products
with the average build time for each. We then resample from this
DataFrame, computing the average of each resample. But this is a
resample of products, not of product builds. The size of the resample is
the number of unique products in `app_data`

, not the number
of reported product builds in `app_data`

. Further, we get
incorrect results by averaging numbers that are already averages. For
example, if 5 people build bed A and it takes them each 1 hour, and 1
person builds bed B and it takes them 10 hours, the average amount of
time to build a bed is \frac{5*1+10}{6} =
2.5. But if we average the times for bed A (1 hour) and average
the times for bed B (5 hours), then average those, we get \frac{1+5}{2} = 3, which is not the same.
More generally, grouping is not a part of the bootstrapping process
because we want each data value to be weighted equally.

The last two answer choices are incorrect because they involve
resampling from the full `app_data`

DataFrame before querying
to keep only the beds. This is incorrect because it does not preserve
the sample size. For example, if `app_data`

contains 1000
reported bed builds and 4000 other product builds, then the only
relevant data is the 1000 bed build times, so when we resample, we want
to consider another set of 1000 beds. If we resample from the full
`app_data`

DataFrame, our resample will contain 5000 rows,
but the number of beds will be random, not necessarily 1000. If we query
first to keep only the beds, then resample, our resample will contain
exactly 1000 beds every time. As an added bonus, since we only care
about beds, it’s much faster to resample from a smaller DataFrame of
beds only than it is to resample from all `app_data`

with
plenty of rows we don’t care about.

The average score on this problem was 71%.