← return to practice.dsc10.com
These problems are taken from past quizzes and exams. Work on them
on paper, since the quizzes and exams you take in this
course will also be on paper.
We encourage you to complete these
problems during discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all of
these problems during the discussion section; the problems we don’t
cover can be used for extra practice.
Suppose we take a uniform random sample with replacement from a population, and use the sample mean as an estimate for the population mean. Which of the following is correct?
If we take a larger sample, our sample mean will be closer to the population mean.
If we take a smaller sample, our sample mean will be closer to the population mean.
If we take a larger sample, our sample mean is more likely to be close to the population mean than if we take a smaller sample.
If we take a smaller sample, our sample mean is more likely to be close to the population mean than if we take a larger sample.
Given below is the season
DataFrame, which contains
statistics on all players in the WNBA in the 2021 season. The first few
rows of season
are shown below:
Each row in season corresponds to a single player. In this problem,
we’ll be looking at the 'PPG'
column, which records the
number of points scored per game played.
Now, suppose we only have access to the DataFrame
small_season
, which is a random sample of size
36 from season
. We’re interested in learning about
the true mean points per game of all players in season
given just the information in small_season
.
To start, we want to bootstrap small_season
10,000 times
and compute the mean of the resample each time. We want to store these
10,000 bootstrapped means in the array boot_means
.
Here is a broken implementation of this procedure.
= np.array([])
boot_means for i in np.arange(10000):
= small_season.sample(season.shape[0], replace=False) # Line 1
resample = small_season.get('PPG').mean() # Line 2
resample_mean # Line 3 np.append(boot_means, new_mean)
For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.
What is incorrect about Line 1? Select all that apply.
Currently the procedure samples from small_season
, when
it should be sampling from season
The sample size is season.shape[0]
, when it should be
small_season.shape[0]
Sampling is currently being done without replacement, when it should be done with replacement
Line 1 is correct as-is
What is incorrect about Line 2? Select all that apply.
Currently it is taking the mean of the 'PPG'
column in
small_season
, when it should be taking the mean of the
'PPG'
column in season
Currently it is taking the mean of the 'PPG'
column in
small_season
, when it should be taking the mean of the
'PPG'
column in resample
.mean()
is not a valid Series method, and should be
replaced with a call to the function np.mean
Line 2 is correct as-is
What is incorrect about Line 3? Select all that apply.
The result of calling np.append
is not being reassigned
to boot_means
, so boot_means
will be an empty
array after running this procedure
The indentation level of the line is incorrect –
np.append
should be outside of the for
-loop
(and aligned with for i
)
new_mean
is not a defined variable name, and should be
replaced with resample_mean
Line 3 is correct as-is
We construct a 95% confidence interval for the true mean points per game for all players by taking the middle 95% of the bootstrapped sample means.
= np.percentile(boot_means, 2.5)
left_b = np.percentile(boot_means, 97.5)
right_b = [left_b, right_b] boot_ci
We find that boot_ci
is the interval [7.7, 10.3].
However, the mean points per game in season
is 7, which is
not in the interval we found. Which of the following statements is true?
(Select all question)
95% of games in season
have a number of points between
7.7 and 10.3.
95% of values in boot_means
fall between 7.7 and
10.3.
There is a 95% chance that the true mean points per game is between 7.7 and 10.3.
The interval we created did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of them would contain the true mean points per game.
The interval we created did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.
True or False: Suppose that from a sample, you compute a 95% bootstrapped confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.
= np.array([])
results for i in np.arange(10):
= np.random.choice(np.arange(1000), replace=False)
result = np.append(results, result) results
After this code executes, results
contains:
a simple random sample of size 9, chosen from a set of size 999 with replacement
a simple random sample of size 9, chosen from a set of size 999 without replacement
a simple random sample of size 10, chosen from a set of size 1000 with replacement
a simple random sample of size 10, chosen from a set of size 1000 without replacement
An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
app_data
has a row for each product build that was logged
on the app. The columns are:
'product'
(str
): the name of the product,
which includes the product line as the first word, followed by a
description of the product'category'
(str
): a categorical
description of the type of product'assembly_time'
(str
): the amount of time
to assemble the product, formatted as 'x hr, y min'
where
x
and y
represent integers, possibly zero'minutes'
(int
): integer values
representing the number of minutes it took to assemble each product
We want to use app_data
to estimate the average amount
of time it takes to build an IKEA bed (any product in the
'bed'
category). Which of the following strategies would be
an appropriate way to estimate this quantity? Select all that apply.
Query to keep only the beds. Then resample with replacement many
times. For each resample, take the mean of the 'minutes'
column. Compute a 95% confidence interval based on those means.
Query to keep only the beds. Group by 'product'
using
the mean aggregation function. Then resample with replacement many
times. For each resample, take the mean of the 'minutes'
column. Compute a 95% confidence interval based on those means.
Resample with replacement many times. For each resample, first query
to keep only the beds and then take the mean of the
'minutes'
column. Compute a 95% confidence interval based
on those means.
Resample with replacement many times. For each resample, first query
to keep only the beds. Then group by 'product'
using the
mean aggregation function, and finally take the mean of the
'minutes'
column. Compute a 95% confidence interval based
on those means.
Suppose we have access to a simple random sample of all US Costco
members of size 145. Our sample is stored in a
DataFrame named us_sample
, in which the
"Spend"
column contains the October 2023 spending of each
sampled member in dollars.
Fill in the blanks below so that us_left
and
us_right
are the left and right endpoints of a
46% confidence interval for the average October 2023
spending of all US members.
= np.array([])
costco_means for i in np.arange(5000):
= __(x)__
resampled_spends = np.append(costco_means, resampled_spends.mean())
costco_means = np.percentile(costco_means, __(y)__)
left = np.percentile(costco_means, __(z)__) right
Which of the following could go in blank (x)? Select all that apply.
us_sample.sample(145, replace=True).get("Spend")
us_sample.sample(145, replace=False).get("Spend")
np.random.choice(us_sample.get("Spend"), 145)
np.random.choice(us_sample.get("Spend"), 145, replace=True)
np.random.choice(us_sample.get("Spend"), 145, replace=False)
None of the above.
What goes in blanks (y) and (z)? Give your answers as integers.
True or False: 46% of all US members in
us_sample
spent between us_left
and
us_right
in October 2023.
True
False
True or False: If we repeat the code from part (b) 200 times, each time bootstrapping from a new random sample of 145 members drawn from all US members, then about 92 of the intervals we create will contain the average October 2023 spending of all US members.
True
False
True or False: If we repeat the code from part (b) 200 times, each
time bootstrapping from us_sample
, then about
92 of the intervals we create will contain the average
October 2023 spending of all US members.
True
False
Rank these three students in ascending order of their exam performance relative to their classmates.
Hector, Clara, Vivek
Vivek, Hector, Clara
Clara, Hector, Vivek
Vivek, Clara, Hector
The data visualization below shows all Olympic gold medals for women’s gymnastics, broken down by the age of the gymnast.
Based on this data, rank the following three quantities in ascending order: the median age at which gold medals are earned, the mean age at which gold medals are earned, the standard deviation of the age at which gold medals are earned.
mean, median, SD
median, mean, SD
SD, mean, median
SD, median, mean
Among all Costco members in San Diego, the average monthly spending in October 2023 was $350 with a standard deviation of $40.
The amount Ciro spent at Costco in October 2023 was -1.5 in standard units. What is this amount in dollars? Give your answer as an integer.
What is the minimum possible percentage of San Diego members that spent between $250 and $450 in October 2023?
16%
22%
36%
60%
78%
84%
Now, suppose we’re given that the distribution of monthly spending in October 2023 for all San Diego members is roughly normal. Given this fact, fill in the blanks:
What are m and n? Give your answers as integers rounded to the nearest multiple of 10.