Discussion 6: Sampling, Bootstrapping, and Confidence Intervals

← return to practice.dsc10.com


The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

Suppose we take a uniform random sample with replacement from a population, and use the sample mean as an estimate for the population mean. Which of the following is correct?


Problem 2

Given below is the season DataFrame, which contains statistics on all players in the WNBA in the 2021 season. The first few rows of season are shown below:

Each row in season corresponds to a single player. In this problem, we’ll be looking at the 'PPG' column, which records the number of points scored per game played.

Now, suppose we only have access to the DataFrame small_season, which is a random sample of size 36 from season. We’re interested in learning about the true mean points per game of all players in season given just the information in small_season.

To start, we want to bootstrap small_season 10,000 times and compute the mean of the resample each time. We want to store these 10,000 bootstrapped means in the array boot_means.

Here is a broken implementation of this procedure.

boot_means = np.array([])                                           
for i in np.arange(10000):                                          
    resample = small_season.sample(season.shape[0], replace=False)  # Line 1
    resample_mean = small_season.get('PPG').mean()                  # Line 2
    np.append(boot_means, new_mean)                                 # Line 3

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.


Problem 2.1

What is incorrect about Line 1? Select all that apply.


Problem 2.2

What is incorrect about Line 2? Select all that apply.


Problem 2.3

What is incorrect about Line 3? Select all that apply.


Problem 2.4

We construct a 95% confidence interval for the true mean points per game for all players by taking the middle 95% of the bootstrapped sample means.

left_b = np.percentile(boot_means, 2.5)
right_b = np.percentile(boot_means, 97.5)
boot_ci = [left_b, right_b]         

We find that boot_ci is the interval [7.7, 10.3]. However, the mean points per game in season is 7, which is not in the interval we found. Which of the following statements is true? (Select all question)



Problem 3

True or False: Suppose that from a sample, you compute a 95% bootstrapped confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.


Problem 4

results = np.array([])
for i in np.arange(10):
    result = np.random.choice(np.arange(1000), replace=False)
    results = np.append(results, result)

After this code executes, results contains:


Problem 5

An IKEA fan created an app where people can log the amount of time it took them to assemble their IKEA furniture. The DataFrame app_data has a row for each product build that was logged on the app. The columns are:

We want to use app_data to estimate the average amount of time it takes to build an IKEA bed (any product in the 'bed' category). Which of the following strategies would be an appropriate way to estimate this quantity? Select all that apply.


Problem 6

Suppose we have access to a simple random sample of all US Costco members of size 145. Our sample is stored in a DataFrame named us_sample, in which the "Spend" column contains the October 2023 spending of each sampled member in dollars.


Problem 6.1

Fill in the blanks below so that us_left and us_right are the left and right endpoints of a 46% confidence interval for the average October 2023 spending of all US members.

costco_means = np.array([])
for i in np.arange(5000):
    resampled_spends = __(x)__
    costco_means = np.append(costco_means, resampled_spends.mean())
left = np.percentile(costco_means, __(y)__)
right = np.percentile(costco_means, __(z)__)

Which of the following could go in blank (x)? Select all that apply.

What goes in blanks (y) and (z)? Give your answers as integers.


Problem 6.2

True or False: 46% of all US members in us_sample spent between us_left and us_right in October 2023.


Problem 6.3

True or False: If we repeat the code from part (b) 200 times, each time bootstrapping from a new random sample of 145 members drawn from all US members, then about 92 of the intervals we create will contain the average October 2023 spending of all US members.


Problem 6.4

True or False: If we repeat the code from part (b) 200 times, each time bootstrapping from us_sample, then about 92 of the intervals we create will contain the average October 2023 spending of all US members.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.