← return to practice.dsc10.com
Below are practice problems tagged for Lecture 14 (rendered directly from the original exam/quiz sources).
In apps, our sample of 1,000 credit card applications,
500 of the applications come from homeowners and 500 come from people
who don’t own their own home. In this sample, homeowner ages have a mean
of 40 and standard deviation of 10. We want to use the bootstrap method
to compute a confidence interval for the mean age of a homeowner in the
population of all credit card applicants.
Note: This problem is out of scope; it covers material no longer included in the course.
Suppose our computer is too slow to bootstrap 10,000 times, and instead can only bootstrap 20 times. Here are the 20 resample means, sorted in ascending order: \begin{align*} &37, 38, 39, 39, 40, 40, 40, 40, 41 , 41, \\ &42, 42, 42, 42, 42, 42, 43, 43, 43 , 44 \end{align*} What are the left and right endpoints of a bootstrapped 80% confidence interval for the population mean? Use the mathematical definition of percentile.
Answer: Left endpoint = 38, Right endpoint = 43
To find an 80% confidence interval, we need to find the 10th and 90th percentiles of the resample means. Using the mathematical definiton of percentile, the 10th percentile is at position 0.1*20 = 2 when we count starting with 1. Since 38 is the second element of the sorted data, that is the left endpoint of our confidence interval.
The average score on this problem was 63%.
Similarly, the 90th percentile is at position 0.9*20 = 18 when we count starting with 1. Since 43 is the 18th element of the sorted data, that is the right endpoint of our confidence interval.
The average score on this problem was 65%.
Note: This problem is out of scope; it covers material no longer included in the course.
True or False: Using the mathematical definition of percentile, the 50th percentile of the bootstrapped distribution above equals its median.
True
False
Answer: False
The 50th percentile according to the mathematial definition is the element at position 0.5*20 10 when we count starting with 1. The 10th element is 41. However, the median of a data set with 20 elements is halfway between the 10th and 11th values. So the median in this case is 41.5.
The average score on this problem was 79%.
Consider the following three quantities:
pop_mean, the unknown mean age of homeowners in the
population of all credit card applicants.
sample_mean, the mean age of homeowners in our
sample of 500 applications in apps. We know this is
40.
resample_mean, the mean age of homeowners in one
particular resample of the applications in apps.
Which of the following statements about the relationship between these three quantities are guaranteed to be true? Select all that apply.
If sample_mean is less than pop_mean, then
resample_mean is also less than pop_mean.
The mean of sample_mean and resample_mean
is closer to pop_mean than either of the two values
individually.
resample_mean is closer than sample_mean to
pop_mean.
resample_mean is further than sample_mean
from pop_mean.
None of the above.
Answer: None of the above.
Whenever we take a sample from a population, there is no guaranteed
relationship between the mean of the sample and the mean of the
population. Sometimes the mean of the sample comes out larger than the
population mean, sometimes smaller. We know this from the CLT which says
that the distribution of the sample mean is centered at the
population mean. Similarly, when we resample from an original mean, the
resample mean could be larger or smaller than the original sample’s
mean. The three quantities pop_mean,
sample_mean, and resample_mean can be in any
relative order. This means none of the statements listed here are
necessarily true.
The average score on this problem was 37%.
As a senior suffering from senioritis, Weiyue has plenty of time on his hands. 1,000 times, he repeats the following process, creating 1,000 confidence intervals:
Collect a simple random sample of 100 rows from
txn.
Resample from his sample 10,000 times, computing the mean transaction amount in each resample.
Create a 95% confidence interval by taking the middle 95% of resample means.
He then computes the width of each confidence interval by subtracting its left endpoint from its right endpoint; e.g. if [2, 5] is a confidence interval, its width is 3. This gives him 1,000 widths. Let M be the mean of these 1,000 widths.
Select the true statement below.
About 950 of Weiyue’s intervals will contain the mean transaction amount of all transactions ever.
About 950 of Weiyue’s intervals will contain the mean transaction
amount of all transactions in txn.
About 950 of Weiyue’s intervals will contain the mean transaction
amount of all transactions in the first random sample of 100 rows of
txn Weiyue took.
About 950 of Weiyue’s intervals will contain M.
Answer: About 950 of Weiyue’s intervals will contain
the mean transaction amount of all transactions in txn.
By the definition of a 95% confidence interval, 95% of our 1000
confidence intervals will contain the true mean transaction amount in
the population from which our samples were drawn. In this case, the
population is the txn DataFrame. So, 950 of the confidence
intervals will contain the mean transaction amount of all transactions
in txn, which is what the the second answer choice
says.
We can’t conclude that the first answer choice is correct because our
original sample was taken from txn, not from all
transactions ever. We don’t know whether our resamples will be
representative of all transactions ever. The third option is incorrect
because we have no way of knowing what the first random sample looks
like from a statistical standpoint. The last statement is not true
because M concerns the width of the
confidence interval, and therefore is unrelated to the statistics
computed in each resample. For example, if the mean of each resample is
around 100, but the width of each confidence interval is around 5, we
shouldn’t expect $$M to be in any of the confidence intervals.
The average score on this problem was 55%.
Weiyue repeats his entire process, except this time, he changes his sample size in step 1 from 100 to 400. Let B be the mean of the widths of the 1,000 new confidence intervals he creates.
What is the relationship between M and B?
M < B
M \approx B
M > B
Answer: M > B
As the sample size increases, the width of the confidence intervals will decrease, so M > B.
The average score on this problem was 70%.
Weiyue repeats his entire process once again. This time, he still uses a sample size of 100 in step 1, but instead of creating 95% confidence intervals in step 3, he creates 99% confidence intervals. Let C be the mean of the widths of the 1,000 new confidence intervals he generates.
What is the relationship between M and C?
M < C
M \approx C
M > C
Answer: M < C
All else equal (note that the sample size is the same as it was in question 10.1), 99% confidence intervals will always be wider than 95% confidence intervals on the same data, so M < C.
The average score on this problem was 85%.
Weiyue repeats his entire process one last time. This time, he still uses a sample size of 100 in step 1, and creates 95% confidence intervals in step 3, but instead of bootstrapping, he uses the Central Limit Theorem to generate his confidence intervals. Let D be the mean of the widths of the 1,000 new confidence intervals he creates.
What is the relationship between M and D?
M < D
M \approx D
M > D
Answer: M \approx D
Confidence intervals generated from the Central Limit Theorem will be approximately the same as those generated from bootstrapping, so M is approximately equal to D.
The average score on this problem was 90%.
On Reddit, Keenan also read that 22% of all online transactions are fraudulent. He decides to test the following hypotheses at the 0.16 significance level:
Null Hypothesis: The proportion of online transactions that are fraudulent is 0.22.
Alternative Hypothesis: The proportion of online transactions that are fraudulent is not 0.22.
Keenan has access to a simple random sample of txn of
size 500. In his sample, the proportion of transactions
that are fraudulent is 0.23.
Below is an incomplete implementation of the function
reject_null, which creates a bootstrap-based confidence
interval and returns True if the conclusion of Keenan’s
test is to reject the null hypothesis, and
False if the conclusion is to fail to
reject the null hypothesis, all at the 0.16
significance level.
def reject_null():
fraud_counts = np.array([])
for i in np.arange(10000):
fraud_count = np.random.multinomial(500, __(a)__)[0]
fraud_counts = np.append(fraud_counts, fraud_count)
L = np.percentile(fraud_counts, __(b)__)
R = np.percentile(fraud_counts, __(c)__)
if __(d)__ < L or __(d)__ > R:
# Return True if we REJECT the null.
return True
else:
# Return False if we FAIL to reject the null.
return FalseFill in the blanks so that reject_null works as
intended.
Hint: Your answer to (d) should be an integer greater than 50.
Answer:
[0.23, 0.77]892110(a): Because we’re bootstrapping, we’re using the data from the
original sample. This is not a “regular” hypothesis testing where we
simulate under the assumptions of the null. It’s more like the human
body temperature example from lecture, where are constructing a
confidence interval, then we’ll determine which hypothesis to side with
based on whether some value falls in the interval or not. Here, they
tell us to make a bootstrapped confidence interval. Normally we’d use
the .sample method for this, but we’re forced here to use
np.random.multinomial, which also works because that
samples with replacement from a categorical distribution, and since
we’re working with a dataset of just two values for whether a
transaction is fraudulent or not, we can think of resampling from our
original sample as drawing from a categorical distribution.
We know that the proportion of fraudulent transactions in the sample
is 0.23 (and therefore the non-fraudulent proportion is 0.77), so we use
these as the probabilities for np.random.multinomial in our
bootstrapping simulation. The syntax for this function requires us to
pass in the probabilities as a list, so the answer is
[0.23, 0.77].
The average score on this problem was 23%.
(b): Since we’re testing at the 0.16 significance level, we know that
the proportion of data lying outside either of our endpoints is 0.16, or
0.08 on each side. So, the left endpoint is given by the 8th percentile,
which means that the argument to np.percentile must be
8.
The average score on this problem was 67%.
(c): Similar to part B, we know that 0.08 of the data must lie to the
right of the right endpoint, so the argument to
np.percentile here is (1 - 0.08)
\cdot 100 = 92.
The average score on this problem was 67%.
(d): To test our hypothesis, we must compare the left and right endpoints to the observed value. If the observed value is less than the left endpoint or greater than the right endpoint, we will reject the null hypothesis. Otherwise we fail to reject it. Since the left and right endpoints give the count of fraudulent transactions (not the proportion), we must convert our null hypothesis to similar terms. We can simply multiply the sample size by the proportion of fraudulent transactions to obtain the count that the null hypothesis would suggest given the sample size of 500, which gives us 500 * 0.22 = 110.
The average score on this problem was 26%.
You want to use the data in apts to test both of the
following pairs of hypotheses:
Pair 1:
Pair 2:
In apts, there are 467 apartments that are either one
bedroom or two bedroom apartments. You perform the following simulation
under the assumption of the null hypothesis.
prop_1bf = np.array([])
abs_diff = np.array([])
for i in np.arange(10000):
prop = np.random.multinomial(467, [0.5, 0.5])[0]/467
prop_1br = np.append(prop_1br, prop)
abs_diff = np.append(abs_diff, np.abs(prop-0.5))You then calculate some percentiles of prop_1br. The
following four expressions all evaluate to True.
np.percentiles(prop_1br, 2.5) == 0.4
np.percentiles(prop_1br, 5) == 0.42
np.percentiles(prop_1br, 95) == 0.58
np.percentiles(prop_1br, 97.5) == 0.6What is prop_1br.mean() to two decimal places?
Answer: 0.5
From the given percentiles, we can notice that since the distribution is symmetric around the mean, the mean should be around the 50th percentile. Given the symmetry and the percentiles around 0.5, we can infer that the mean should be very close to 0.5.
Another way we can look at it is by noticing that prop
is pulled from a [0.5, 0.5]
distribution (because we are simulating under the null hypotheses) in
np.random.multinomial(). This means that its expected for
most of the distribution to be from around 0.5.
The average score on this problem was 84%.
What is np.std(prop_1br) to two decimal places?
Answer: 0.05
If we look again at the percentiles, we notice that it seems to resemble a normal distribution. So by taking the mean and the 97.5th percentile, we can solve for the standard deviation. Since [2.5, 97.5] is the 95% confidence interval, we can say that the 97.5th percentile is two standard deviations away from the mean (2.5 too!). Thus,
0.5 + 2 \cdot \text{SD} = 0.6
\therefore Solving for SD, we get \text{SD} = 0.05
The average score on this problem was 45%.
What is np.percentile(abs_diff, 95) to two decimal
places?
Answer: 0.1
Each time through our for-loop, we execute the following lines of code:
prop_1br = np.append(prop_1br, prop)
abs_diff = np.append(abs_diff, np.abs(prop-0.5))
Additionally, we’re told the following statements evaluate to True:
np.percentiles(prop_1br, 2.5) == 0.4
np.percentiles(prop_1br, 5) == 0.42
np.percentiles(prop_1br, 95) == 0.58
np.percentiles(prop_1br, 97.5) == 0.6
We can combine these pieces of information to find the answer to this question.
First, consider the shape of the distribution of
prop_1br. We know it’s symmetrical around 0.5, and beyond
that, we can infer that it’s a normal distribution.
Now, think about how this relates to the distribution of
abs_diff. abs_diff is generated by finding the
absolute difference between prop_1br and 0.5. Because of
this, abs_diff is an array of distances (which are nonnegative by
definition) from 0.5.
We know that prop_1br is normal, and symmetrical about
0.5. So, the distribution of how far away prop_1br is from
0.5 will look like we took the distribution of prop_1br,
moved it to be centered at 0, and folded it in half so that all negative
values become positive. This is because the previous center at 0.5
represents a distance of 0 from 0.5. Similarly, a value of 0.6 would
represent a distance of 0.1 from 0.5, and a value of 0.4 would also
represent a distance of 0.1 from 0.5.
Now, the problem becomes much simpler to solve. Before, we were told
that 95% of our the in prop_1br lies between 0.4 and 0.6
(Thanks to the lines of code that evaluate to True). This is the same as
telling us that 95% of the data in prop_1br lies within a
distance of 0.1 to 0.5 (Because 0.4 and 0.6 are both 0.1 away from
0.5).
Because of this, the 95% percentile of abs_diff is 0.1, since 95% of
the data in prop_1br lies within a distance of 0.1 to 0.5
(meaning that 95% of the data in abs_diff is between 0 and 0.1).
The average score on this problem was 10%.
Which simulated test statistics should be used to test the first pair of hypotheses?
prop_1br
abs_diff
Answer: prop_1br
Our first pair of hypotheses’ alternative hypothesis asks if one number is greater than the other. Because of this, we can’t use an absolute value test statistic to answer the question, since all absolute value cares about is the distance the simulation is from the null assumption, not whether one value is greater than the other.
The average score on this problem was 82%.
Which simulated test statistics should be used to test the second pair of hypotheses?
prop_1br
abs_diff
Answer: abs_diff
Our first pair of hypotheses’ alternative hypothesis asks if one number is not equal to the other. Because of this, we have to use a test statistic that sees the distance both ways, not just in one direction. Therefore, we use the absolute value.
The average score on this problem was 83%.
Your observed data in apts is such that you reject the
null for the first pair of hypotheses at the 5% significance level, but
fail to reject the null for the second pair at the 5% significance
level. What could the value of the following proportion have been?
\frac{\text{\# of one bedroom apartments in \texttt{apts}}}{\text{\# of one bedroom apartments in \texttt{apts}+ \# of two bedroom apartments in \texttt{apts}}}
Give your answer as a number to two decimal places.
Answer: 0.59
The average score on this problem was 20%.
You want to know how much extra it costs, on average, to have a
washer and dryer in your apartment. Since this cost is built into the
monthly rent, it isn’t clear how much of your rent will be going towards
this convenience. You decide to bootstrap the data in apts
to estimate the average monthly cost of having in-unit laundry.
Fill in the blanks to generate 10,000 bootstrapped estimates for the average montly cost of in-unit laundry.
yes = apts[apts.get("Laundry")]
no = apts[apts.get("Laundry") == False]
laundry_stats = np.array([])
for i in np.arange(10000):
yes_resample = yes.sample(__(a)__, __(b)__)
no_resample = no.sample(__(c)__, __(d)__)
one_stat = __(e)__
laundry_stats = np.append(laundry_stats, one_stat)Answer:
yes.shape[0]replace=Trueno.shape[0]replace=Trueyes_resample.get("Rent").mean() - no_resample.get("Rent").mean()For both yes_resample and no_resample, we
need to use their respective DataFrames to create a bootstrapped
estimate. Therefore, we randomly sample from their respective DataFrames
with replacement (the law of bootstrap). Then, to calculate the test
statistic, we need to look back at what the question asks of us: to
estimate the average monthly cost of having in-unit
laundry, so we subtract the mean of the bootstrapped estimate
for no (no_resample) from the mean of the
bootstrapped estimate for yes
(yes_resample).
The average score on this problem was 79%.
What if you wanted to instead estimate the average yearly cost of having in-unit laundry?
Below, change the blank (e), such that the procedure not generates 10,000 bootstrapped estimates for the average yearly cost of in-unit laundry.
Suppose you ran your original code from part (a) and used the results to calculate a confidence interval for the average monthly cost of in-unit laundry, which came out to be
[L_M, R_M].
Then, you changed blank (e) as you described above, and ran the code again to calculate a different confidence interval for the average yearly cost of in-unit laundry, which came out to be
[L_Y, R_Y].
Which of the following is the best description of the relationship between the endpoints of these confidence intervals? Note that the symbol \approx means “approximately equal.”
L_Y = 12 \cdot L_M and R_Y = 12 \cdot R_M
L_Y \approx 12 \cdot L_M and R_Y \approx 12 \cdot R_M
L_M = 12 \cdot L_Y and R_M = 12 \cdot R_Y
L_M \approx 12 \cdot L_Y and R_M \approx 12 \cdot R_Y
None of these.
Answer:
12 * (yes_resample.get("Rent").mean() - no_resample.get("Rent").mean())For both L_Y and R_Y, we cannot say that we certainly know that it will be precisely 12 times the value of the average monthly cost. Because every month and year has variablity/noise, we cannot say for certain that it will most definitely be 12 times the value of average monthly cost, but instead will probably be approximately equal.
The bottom two choices flip the inequality and state that the average monthly cost is 12 times the value of the average yearly cost, which would be vastly different from one another.
The average score on this problem was 85%.
The average score on this problem was 79%.
You’re concerned about the validity of your estimates because you think bigger apartments are more likely to have in-unit laundry for one bedroom apartments only.
If your concern is valid and it is true that bigger apartments are
more likely to have in-unit laundry, how will your bootstrapped
estimates for the average monthly cost of in-unit laundry for one
bedroom apartments only compare to the values you computed in part (a)
based on all the apts?
The estimates will be about the same.
The estimates will be generally larger than those you computed in part (a).
The estimates will be generally smaller than those you computed in part (a).
Answer: The estimates will be generally smaller than those you computed in part (a).
If we query the yes and no DataFrames to
contain only one bedroom apartments, the average "Rent" of
these two DataFrames will probably be smaller than the original
DataFrames. Because these two DataFrames now have a smaller mean, their
bootstraps are also likely to also be smaller than what it originally
was.
Another way we can think of it is by first calling our original
yes and no DataFrames as
yes_population and no_population respectively.
Now, if we take yes_population and
no_population on a histogram, we’ll likely see higher
magnitude "Rent" outliers. By removing these outliers, we
are now in a scenario similar to what the question asks. By taking this
smaller subset that doesn’t have outliers and bootstrap, we will most
likely get a smaller estimate than that seen from
yes_population and no_population
bootstraps.
The average score on this problem was 80%.
Consider the distribution of laundry_stats as computed
in part (a). How would this distribution change if we:
The distribution would be wider.
The distribution would be narrower.
The distribution would not change significantly.
apts?The distribution would be wider.
The distribution would be narrower.
The distribution would not change significantly.
Answer:
When the number of repetitions are increased, the overall distribution will end up looking the same. If anything, increasing the number of repetitions would make the bootstrap distribution look more like the true population distribution.
If only half of the rows are used, there would be more variability in the bootstrap, leading to a wider distribution.
The average score on this problem was 36%.
The average score on this problem was 72%.
We want to estimate the mean distance of Tour de France stages by
bootstrapping 10,000 times and constructing a 90% confidence interval
for the mean. In this question, suppose random_stages is a
random sample of size 500 drawn with replacement from
stages. Identify the line numbers with errors in the code
below. In the adjacent box, point out the error by describing the
mistake in less than 10 words or writing a code snippet (correct only
the part you think is wrong). You may or may not need all the spaces
provided below to identify errors.
line 1: means = np.array([])
line 2:
line 3: for i in 10000:
line 4: resample = random_stages.sample(10000)
line 5: resample_mean = resample.get("Distance").mean()
line 6: np.append(means, resample_mean)
line 7:
line 8: left_bound = np.percentile(means, 0)
line 9: right_bound = np.percentile(means, 90)Answer:
for i in np.arange(10000):
for loop syntax is incorrect. 10000 is
an integer, not an iterable. To iterate 10,000 times,
np.arange(10000) must be used.random_stages.sample(500, replace=True)
replace=True is required for
sampling with replacement.means = np.append(means, resample_mean)
np.append does not modify the array in place. The
means array must be reassigned to include the new
value.np.percentile(means, 5)
np.percentile(means, 95)
The average score on this problem was 88%.
True or False: Suppose that from a sample, you compute a 95% bootstrapped confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.
Answer: False
A 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Note that the problem specifies that the confidence interval is bootstrapped. Since the interval is found using bootstrapping, L and R averaged will not be the mean of the original sample since the mean of the original sample is not what is used in calculating the bootstrapped confidence interval. The bootstrapped confidence interval is created by re-sampling the data with replacement over and over again. Thus, while the interval is typically centered around the sample mean due to the nature of bootstrapping, the average of L and R (the 2.5th and 97.5th percentiles of the distribution of bootstrapped means) may not exactly equal the sample mean, but should be close to it. Additionally, L is the 2.5th percentile of the distribution of bootstrapped means and R is the 97.5th percentile, and these are not necessarily the same distance away from the mean of the sample.
The average score on this problem was 87%.
Suppose you do an experiment in which you do some random process 500 times and calculate the value of some statistic, which is a count of how many times a certain phenomenon occurred out of the 500 trials. You repeat the experiment 10,000 times and draw a histogram of the 10,000 statistics.
Is this histogram a probability histogram or an empirical histogram?
probability histogram
empirical histogram
Answer: empirical histogram
Empirical histograms refer to distributions of observed data. Since the question at hand is conducting an experiment and creating a histogram of observed data from these trials the correct anwser is an empirical histogram.
The average score on this problem was 90%.
If you instead repeat the experiment 100,000 times, how will the histogram change?
it will become wider
it will become narrower
it will barely change at all
Answer: it will barely change at all
Doing more of an experiment will barely change the histogram. The parameter we are trying to estimate through our experiment is some statistic. The number of experiments has no effect on the histograms distribution since the value of some statistic is not becoming more random.
The average score on this problem was 57%.
For each experiment, if you instead do the random process 5,000 times, how will the histogram change?
it will become wider
it will become narrower
it will barely change at all
Answer: it will become wider
By increasing the number of random process we increase the possible
range of values from 500 to 5000. The statistic being calculated is the
count of how many times a phenomenon occurs. If the number of random
process increases 10x the statistic can now take values ranging from
[0, 5000] instead of [0, 500] which will
clearly make the histogram width wider (due to the wider range of values
it can take).
The average score on this problem was 39%.
In our sample, we have data on 163 medals for the sport of table tennis. Based on our data, China seems to really dominate this sport, earning 81 of these medals.
That’s nearly half of the medals for just one country! We want to do a hypothesis test with the following hypotheses to see if this pattern is true in general, or just happens to be true in our sample.
Null: China wins half of Olympic table tennis medals.
Alternative: China does not win half of Olympic table tennis medals.
Why can these hypotheses be tested by constructing a confidence interval?
Since proportions are means, so we can use the CLT.
Since the test aims to determine whether a parameter is equal to a fixed value.
Since we need to get a sense of how other samples would come out by bootstrapping.
Since the test aims to determine if our sample came from a known population distribution.
Answer: Since the test aims to determine whether a parameter is equal to a fixed value
The goal of a confidence interval is to provide a range of values that, given the data, are considered plausible for the parameter in question. If the null hypothesis’ fixed value does not fall within this interval, it suggests that the observed data is not very compatible with the null hypothesis. Thus in our case, if a 95% confidence interval for the proportion of medals won by China does not include ~0.5, then there’s statistical evidence at the 5% significance level to suggest that China does not win exactly half of the medals. So again in our case, confidence intervals work to test this hypothesis because we are attempting to find out whether or half of the medals (0.5) lies within our interval at the 95% confidence level.
The average score on this problem was 44%.
Suppose we construct a 95% bootstrapped CI for the proportion of Olympic table tennis medals won by China. Select all true statements.
The true proportion of Olympic table tennis medals won by China has a 95% chance of falling within the bounds of our interval.
If we resampled our original sample and calculated the proportion of Olympic table tennis medals won by China in our resample, there is approximately a 95% chance our interval would contain this number.
95% of Olympic table tennis medals are won by China.
None of the above.
Answer: If we resampled our original sample and calculated the proportion of Olympic table tennis medals won by China in our resample, there is approximately a 95% chance our interval would contain this number.
The second option is the only correct answer because it accurately describes the process and interpretation of a bootstrap confidence interval. A 95% bootstrapped confidence interval means that if we repeatedly sampled from our original sample and constructed the interval each time, approximately 95% of those intervals would contain the true parameter. This statement does not imply that the true proportion has a 95% chance of falling within any single interval we construct; instead, it reflects the long-run proportion of such intervals that would contain the true proportion if we could repeat the process indefinitely. Thus, the confidence interval gives us a method to estimate the parameter with a specified level of confidence based on the resampling procedure.
The average score on this problem was 73%.
True or False: In this scenario, it would also be appropriate to create a 95% CLT-based confidence interval.
True
False
Answer: True
The statement is true because the Central Limit Theorem (CLT) applies to the sampling distribution of the proportion, given that the sample size is large enough, which in our case, with 163 medals, it is. The CLT asserts that the distribution of the sample mean (or proportion, in our case) will approximate a normal distribution as the sample size grows, allowing the use of standard methods to create confidence intervals. Therefore, a CLT-based confidence interval is appropriate for estimating the true proportion of Olympic table tennis medals won by China.
The average score on this problem was 71%.
True or False: If our 95% bootstrapped CI came out to be [0.479, 0.518], we would reject the null hypothesis at the 0.05 significance level.
True
False
Answer: False
This is false, we would fail to reject the null hypothesis because the interval [0.479, 0.518] includes the value of 0.5, which corresponds to the null hypothesis that China wins half of the Olympic table tennis medals. If the confidence interval contains the hypothesized value, there is not enough statistical evidence to reject the null hypothesis at the specified significance level. In this case, the data does not provide sufficient evidence to conclude that the proportion of medals won by China is different from 0.5 at the 0.05 significance level.
The average score on this problem was 92%.
True or False: If we instead chose to test these hypotheses at the 0.01 significance level, the confidence interval we’d create would be wider.
True
False
Answer: True
Lowering the significance level means that you require more evidence to reject the null hypothesis, thus seeking a higher confidence in your interval estimate. A higher confidence level corresponds to a wider interval because it must encompass a larger range of values to ensure that it contains the true population parameter with the increased probability. Thus as we lower the significance level, the interval we create will be wider, making this statement true.
The average score on this problem was 79%.
True or False: If we instead chose to test these hypotheses at the 0.01 significance level, we would be more likely to conclude a statistically significant result.
True
False
Answer: False
This statement is false. A small significance level lowers the chance of getting a statistically significant result; our value for 0.01 significance has to be outside a 99% confidence interval to be statistically significant. In addition, the true parameter was already contained within the tighter 95% confidence interval, so we failed to reject the null hypothesis at the 0.05 significance level. This guarantees failing to reject the null hypotehsis at the 0.01 significance level since we know that whatever is contained in a 95% confidence interval has to also be contained in a 99% confidence interval. Thus, this answer is false.
The average score on this problem was 62%.
The Death Eaters are a powerful group of dark wizards who oppose
Harry Potter and his allies. Each Death Eater receives a unique
identification number based on their order of initiation, ranging from
1 to N, where N represents the
total number of Death Eaters.
Your task is to estimate the value of N so you can
understand how many enemies you face. You have a random sample of
identification numbers in a DataFrame named death_eaters
containing a single column called "ID".
Which of the options below would be an appropriate estimate for the total number of Death Eaters? Select all that apply.
death_eaters.get("ID").max()
death_eaters.get("ID").sum()
death_eaters.groupby("ID").count()
int(death_eaters.get("ID").mean() * 2)
death_eaters.shape[0]
None of the above.
Answer: death_eaters.get("ID").max()
and int(death_eaters.get("ID").mean() * 2)
Option 1: death_eaters.get("ID").max() returns the
maximum ID from the sample. This is an appropriate estimate since the
population size must be at least the size of the largest ID in our
sample. For instance, if the maximum ID observed is 250, then the total
number of Death Eaters must be at least 250.
Option 2: death_eaters.get("ID").sum() returns the
sum of all ID numbers in the sample. The total sum of IDs has no
meaningful connection to the population size, which makes this an
inappropriate estimate.
Option 3: death_eaters.groupby("ID").count() groups
the data by ID and counts occurrences. Since each ID is unique and
death_eaters only includes the "ID" column,
grouping simply shows that each ID appears once. This is not an
appropriate estimate for N.
Option 4: int(death_eaters.get("ID").mean() * 2)
returns twice the mean of the sample IDs as an integer. The mean of a
random sample of the numbers 1 through N usually falls
about halfway between 1 and N. So we can appropriately
estimate N by doubling this mean.
Option 5: death_eaters.shape[0] returns the number
of rows in death_eaters (ie. the sample size). The sample
size does not reflect the total population size, making it an
inappropriate estimate.
The average score on this problem was 66%.
Each box that you selected in part (a) is an example of what?
a distribution
a statistic
a parameter
a resample
Answer: a statistic
The options in part (a) calculate a numerical value from the random
sample death_eaters. This fits the definition of a
statistic.
The average score on this problem was 82%.
Suppose you have access to a function called estimate,
which takes in a Series of Death Eater ID numbers and returns an
estimate for N. Fill in the blanks below to do the
following:
Create an array named boot_estimates, containing
10,000 of these bootstrapped estimates of N, based on the
data in death_eaters.
Set left_72 to the left endpoint of
a 72% confidence interval for N.
boot_estimates = np.array([])
for i in np.arange(10000):
boot_estimates = np.append(boot_estimates, __(a)__)
left_72 = __(b)__
What goes in blank (a)?
Answer:
estimate(death_eaters.sample(death_eaters.shape[0], replace=True).get("ID"))
In the given code, we use a for loop to generate 10,000 bootstrapped
estimates of N and append them to the array
boot_estimates. Blank (a) specifically computes one
bootstrapped estimate of N. Here’s how key parts of the
solution work:
death_eaters.sample(death_eaters.shape[0], replace=True):
To bootstrap, we need to resample the data with replacement. The
sample() function
(see
here) takes as arguments the sample size
(death_eaters.shape[0]) and whether to replace
(replace=True).
.get("ID"): Since estimate() takes a
Series as input, we need to extract the ID column from the
resample.
estimate(): The resampled ID column is passed into
the estimate() function to generate one bootstrapped
estimate of N.
The average score on this problem was 62%.
What goes in blank (b)?
Answer:
np.percentile(boot_estimates, 14)
A 72% confidence interval captures the middle 72% of our
distribution. This leaves 28% of the data outside the interval, with 14%
from the lower tail and 14% from the upper tail. Thus, the left endpoint
corresponds to the 14th percentile of boot_estimates. The
np.percentile() function
(see
here) takes as arguments the array to compute the percentile
(boot_estimates) and the desired percentile (14).
The average score on this problem was 91%.