Lecture 14 — Practice

← return to practice.dsc10.com


Lecture 14 — Collected Practice Questions

Below are practice problems tagged for Lecture 14 (rendered directly from the original exam/quiz sources).


Source: fa22-final — Q3

Problem 1

In apps, our sample of 1,000 credit card applications, 500 of the applications come from homeowners and 500 come from people who don’t own their own home. In this sample, homeowner ages have a mean of 40 and standard deviation of 10. We want to use the bootstrap method to compute a confidence interval for the mean age of a homeowner in the population of all credit card applicants.


Problem 1.1

Note: This problem is out of scope; it covers material no longer included in the course.

Suppose our computer is too slow to bootstrap 10,000 times, and instead can only bootstrap 20 times. Here are the 20 resample means, sorted in ascending order: \begin{align*} &37, 38, 39, 39, 40, 40, 40, 40, 41 , 41, \\ &42, 42, 42, 42, 42, 42, 43, 43, 43 , 44 \end{align*} What are the left and right endpoints of a bootstrapped 80% confidence interval for the population mean? Use the mathematical definition of percentile.

Answer: Left endpoint = 38, Right endpoint = 43

To find an 80% confidence interval, we need to find the 10th and 90th percentiles of the resample means. Using the mathematical definiton of percentile, the 10th percentile is at position 0.1*20 = 2 when we count starting with 1. Since 38 is the second element of the sorted data, that is the left endpoint of our confidence interval.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.

Similarly, the 90th percentile is at position 0.9*20 = 18 when we count starting with 1. Since 43 is the 18th element of the sorted data, that is the right endpoint of our confidence interval.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.


Problem 1.2

Note: This problem is out of scope; it covers material no longer included in the course.

True or False: Using the mathematical definition of percentile, the 50th percentile of the bootstrapped distribution above equals its median.

Answer: False

The 50th percentile according to the mathematial definition is the element at position 0.5*20 10 when we count starting with 1. The 10th element is 41. However, the median of a data set with 20 elements is halfway between the 10th and 11th values. So the median in this case is 41.5.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 1.3

Consider the following three quantities:

  1. pop_mean, the unknown mean age of homeowners in the population of all credit card applicants.

  2. sample_mean, the mean age of homeowners in our sample of 500 applications in apps. We know this is 40.

  3. resample_mean, the mean age of homeowners in one particular resample of the applications in apps.

Which of the following statements about the relationship between these three quantities are guaranteed to be true? Select all that apply.

Answer: None of the above.

Whenever we take a sample from a population, there is no guaranteed relationship between the mean of the sample and the mean of the population. Sometimes the mean of the sample comes out larger than the population mean, sometimes smaller. We know this from the CLT which says that the distribution of the sample mean is centered at the population mean. Similarly, when we resample from an original mean, the resample mean could be larger or smaller than the original sample’s mean. The three quantities pop_mean, sample_mean, and resample_mean can be in any relative order. This means none of the statements listed here are necessarily true.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 37%.



Source: fa23-final — Q10

Problem 2

As a senior suffering from senioritis, Weiyue has plenty of time on his hands. 1,000 times, he repeats the following process, creating 1,000 confidence intervals:

  1. Collect a simple random sample of 100 rows from txn.

  2. Resample from his sample 10,000 times, computing the mean transaction amount in each resample.

  3. Create a 95% confidence interval by taking the middle 95% of resample means.

He then computes the width of each confidence interval by subtracting its left endpoint from its right endpoint; e.g. if [2, 5] is a confidence interval, its width is 3. This gives him 1,000 widths. Let M be the mean of these 1,000 widths.


Problem 2.1

Select the true statement below.

Answer: About 950 of Weiyue’s intervals will contain the mean transaction amount of all transactions in txn.

By the definition of a 95% confidence interval, 95% of our 1000 confidence intervals will contain the true mean transaction amount in the population from which our samples were drawn. In this case, the population is the txn DataFrame. So, 950 of the confidence intervals will contain the mean transaction amount of all transactions in txn, which is what the the second answer choice says.

We can’t conclude that the first answer choice is correct because our original sample was taken from txn, not from all transactions ever. We don’t know whether our resamples will be representative of all transactions ever. The third option is incorrect because we have no way of knowing what the first random sample looks like from a statistical standpoint. The last statement is not true because M concerns the width of the confidence interval, and therefore is unrelated to the statistics computed in each resample. For example, if the mean of each resample is around 100, but the width of each confidence interval is around 5, we shouldn’t expect $$M to be in any of the confidence intervals.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.


Problem 2.2

Weiyue repeats his entire process, except this time, he changes his sample size in step 1 from 100 to 400. Let B be the mean of the widths of the 1,000 new confidence intervals he creates.

What is the relationship between M and B?

Answer: M > B

As the sample size increases, the width of the confidence intervals will decrease, so M > B.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 2.3

Weiyue repeats his entire process once again. This time, he still uses a sample size of 100 in step 1, but instead of creating 95% confidence intervals in step 3, he creates 99% confidence intervals. Let C be the mean of the widths of the 1,000 new confidence intervals he generates.

What is the relationship between M and C?

Answer: M < C

All else equal (note that the sample size is the same as it was in question 10.1), 99% confidence intervals will always be wider than 95% confidence intervals on the same data, so M < C.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 2.4

Weiyue repeats his entire process one last time. This time, he still uses a sample size of 100 in step 1, and creates 95% confidence intervals in step 3, but instead of bootstrapping, he uses the Central Limit Theorem to generate his confidence intervals. Let D be the mean of the widths of the 1,000 new confidence intervals he creates.

What is the relationship between M and D?

Answer: M \approx D

Confidence intervals generated from the Central Limit Theorem will be approximately the same as those generated from bootstrapping, so M is approximately equal to D.


Difficulty: ⭐️

The average score on this problem was 90%.



Source: fa23-final — Q12

Problem 3

On Reddit, Keenan also read that 22% of all online transactions are fraudulent. He decides to test the following hypotheses at the 0.16 significance level:

Keenan has access to a simple random sample of txn of size 500. In his sample, the proportion of transactions that are fraudulent is 0.23.

Below is an incomplete implementation of the function reject_null, which creates a bootstrap-based confidence interval and returns True if the conclusion of Keenan’s test is to reject the null hypothesis, and False if the conclusion is to fail to reject the null hypothesis, all at the 0.16 significance level.

    def reject_null():
        fraud_counts = np.array([])
        for i in np.arange(10000):
            fraud_count = np.random.multinomial(500, __(a)__)[0] 
            fraud_counts = np.append(fraud_counts, fraud_count)
            
        L = np.percentile(fraud_counts, __(b)__)
        R = np.percentile(fraud_counts, __(c)__)

        if __(d)__ < L or __(d)__ > R:
            # Return True if we REJECT the null.
            return True
        else:
            # Return False if we FAIL to reject the null.
            return False

Fill in the blanks so that reject_null works as intended.

Hint: Your answer to (d) should be an integer greater than 50.

Answer:

  • (a): [0.23, 0.77]
  • (b): 8
  • (c): 92
  • (d): 110

(a): Because we’re bootstrapping, we’re using the data from the original sample. This is not a “regular” hypothesis testing where we simulate under the assumptions of the null. It’s more like the human body temperature example from lecture, where are constructing a confidence interval, then we’ll determine which hypothesis to side with based on whether some value falls in the interval or not. Here, they tell us to make a bootstrapped confidence interval. Normally we’d use the .sample method for this, but we’re forced here to use np.random.multinomial, which also works because that samples with replacement from a categorical distribution, and since we’re working with a dataset of just two values for whether a transaction is fraudulent or not, we can think of resampling from our original sample as drawing from a categorical distribution.

We know that the proportion of fraudulent transactions in the sample is 0.23 (and therefore the non-fraudulent proportion is 0.77), so we use these as the probabilities for np.random.multinomial in our bootstrapping simulation. The syntax for this function requires us to pass in the probabilities as a list, so the answer is [0.23, 0.77].


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 23%.

(b): Since we’re testing at the 0.16 significance level, we know that the proportion of data lying outside either of our endpoints is 0.16, or 0.08 on each side. So, the left endpoint is given by the 8th percentile, which means that the argument to np.percentile must be 8.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.

(c): Similar to part B, we know that 0.08 of the data must lie to the right of the right endpoint, so the argument to np.percentile here is (1 - 0.08) \cdot 100 = 92.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.

(d): To test our hypothesis, we must compare the left and right endpoints to the observed value. If the observed value is less than the left endpoint or greater than the right endpoint, we will reject the null hypothesis. Otherwise we fail to reject it. Since the left and right endpoints give the count of fraudulent transactions (not the proportion), we must convert our null hypothesis to similar terms. We can simply multiply the sample size by the proportion of fraudulent transactions to obtain the count that the null hypothesis would suggest given the sample size of 500, which gives us 500 * 0.22 = 110.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 26%.


Source: sp24-final — Q6

Problem 4

You want to use the data in apts to test both of the following pairs of hypotheses:

Pair 1:

Pair 2:

In apts, there are 467 apartments that are either one bedroom or two bedroom apartments. You perform the following simulation under the assumption of the null hypothesis.

prop_1bf = np.array([])
abs_diff = np.array([])
for i in np.arange(10000):
    prop = np.random.multinomial(467, [0.5, 0.5])[0]/467
    prop_1br = np.append(prop_1br, prop)
    abs_diff = np.append(abs_diff, np.abs(prop-0.5))

You then calculate some percentiles of prop_1br. The following four expressions all evaluate to True.

np.percentiles(prop_1br, 2.5) == 0.4
np.percentiles(prop_1br, 5) == 0.42
np.percentiles(prop_1br, 95) == 0.58
np.percentiles(prop_1br, 97.5) == 0.6


Problem 4.1

What is prop_1br.mean() to two decimal places?

Answer: 0.5

From the given percentiles, we can notice that since the distribution is symmetric around the mean, the mean should be around the 50th percentile. Given the symmetry and the percentiles around 0.5, we can infer that the mean should be very close to 0.5.

Another way we can look at it is by noticing that prop is pulled from a [0.5, 0.5] distribution (because we are simulating under the null hypotheses) in np.random.multinomial(). This means that its expected for most of the distribution to be from around 0.5.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 4.2

What is np.std(prop_1br) to two decimal places?

Answer: 0.05

If we look again at the percentiles, we notice that it seems to resemble a normal distribution. So by taking the mean and the 97.5th percentile, we can solve for the standard deviation. Since [2.5, 97.5] is the 95% confidence interval, we can say that the 97.5th percentile is two standard deviations away from the mean (2.5 too!). Thus,

0.5 + 2 \cdot \text{SD} = 0.6

\therefore Solving for SD, we get \text{SD} = 0.05


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.


Problem 4.3

What is np.percentile(abs_diff, 95) to two decimal places?

Answer: 0.1

Each time through our for-loop, we execute the following lines of code:

prop_1br = np.append(prop_1br, prop)

abs_diff = np.append(abs_diff, np.abs(prop-0.5))

Additionally, we’re told the following statements evaluate to True:

np.percentiles(prop_1br, 2.5) == 0.4

np.percentiles(prop_1br, 5) == 0.42

np.percentiles(prop_1br, 95) == 0.58

np.percentiles(prop_1br, 97.5) == 0.6

We can combine these pieces of information to find the answer to this question.

First, consider the shape of the distribution of prop_1br. We know it’s symmetrical around 0.5, and beyond that, we can infer that it’s a normal distribution.

Now, think about how this relates to the distribution of abs_diff. abs_diff is generated by finding the absolute difference between prop_1br and 0.5. Because of this, abs_diff is an array of distances (which are nonnegative by definition) from 0.5.

We know that prop_1br is normal, and symmetrical about 0.5. So, the distribution of how far away prop_1br is from 0.5 will look like we took the distribution of prop_1br, moved it to be centered at 0, and folded it in half so that all negative values become positive. This is because the previous center at 0.5 represents a distance of 0 from 0.5. Similarly, a value of 0.6 would represent a distance of 0.1 from 0.5, and a value of 0.4 would also represent a distance of 0.1 from 0.5.

Now, the problem becomes much simpler to solve. Before, we were told that 95% of our the in prop_1br lies between 0.4 and 0.6 (Thanks to the lines of code that evaluate to True). This is the same as telling us that 95% of the data in prop_1br lies within a distance of 0.1 to 0.5 (Because 0.4 and 0.6 are both 0.1 away from 0.5).

Because of this, the 95% percentile of abs_diff is 0.1, since 95% of the data in prop_1br lies within a distance of 0.1 to 0.5 (meaning that 95% of the data in abs_diff is between 0 and 0.1).


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 10%.


Problem 4.4

Which simulated test statistics should be used to test the first pair of hypotheses?

Answer: prop_1br

Our first pair of hypotheses’ alternative hypothesis asks if one number is greater than the other. Because of this, we can’t use an absolute value test statistic to answer the question, since all absolute value cares about is the distance the simulation is from the null assumption, not whether one value is greater than the other.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 4.5

Which simulated test statistics should be used to test the second pair of hypotheses?

Answer: abs_diff

Our first pair of hypotheses’ alternative hypothesis asks if one number is not equal to the other. Because of this, we have to use a test statistic that sees the distance both ways, not just in one direction. Therefore, we use the absolute value.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 4.6

Your observed data in apts is such that you reject the null for the first pair of hypotheses at the 5% significance level, but fail to reject the null for the second pair at the 5% significance level. What could the value of the following proportion have been?

\frac{\text{\# of one bedroom apartments in \texttt{apts}}}{\text{\# of one bedroom apartments in \texttt{apts}+ \# of two bedroom apartments in \texttt{apts}}}

Give your answer as a number to two decimal places.

Answer: 0.59


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 20%.



Source: sp24-final — Q7

Problem 5

You want to know how much extra it costs, on average, to have a washer and dryer in your apartment. Since this cost is built into the monthly rent, it isn’t clear how much of your rent will be going towards this convenience. You decide to bootstrap the data in apts to estimate the average monthly cost of having in-unit laundry.


Problem 5.1

Fill in the blanks to generate 10,000 bootstrapped estimates for the average montly cost of in-unit laundry.

yes = apts[apts.get("Laundry")]
no = apts[apts.get("Laundry") == False]
laundry_stats = np.array([])
for i in np.arange(10000):
    yes_resample = yes.sample(__(a)__, __(b)__)
    no_resample = no.sample(__(c)__, __(d)__)
    one_stat = __(e)__
    laundry_stats = np.append(laundry_stats, one_stat)

Answer:

  • (a): yes.shape[0]
  • (b): replace=True
  • (c): no.shape[0]
  • (d): replace=True
  • (e): yes_resample.get("Rent").mean() - no_resample.get("Rent").mean()

For both yes_resample and no_resample, we need to use their respective DataFrames to create a bootstrapped estimate. Therefore, we randomly sample from their respective DataFrames with replacement (the law of bootstrap). Then, to calculate the test statistic, we need to look back at what the question asks of us: to estimate the average monthly cost of having in-unit laundry, so we subtract the mean of the bootstrapped estimate for no (no_resample) from the mean of the bootstrapped estimate for yes (yes_resample).


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 5.2

What if you wanted to instead estimate the average yearly cost of having in-unit laundry?

  1. Below, change the blank (e), such that the procedure not generates 10,000 bootstrapped estimates for the average yearly cost of in-unit laundry.

  2. Suppose you ran your original code from part (a) and used the results to calculate a confidence interval for the average monthly cost of in-unit laundry, which came out to be

[L_M, R_M].

Then, you changed blank (e) as you described above, and ran the code again to calculate a different confidence interval for the average yearly cost of in-unit laundry, which came out to be

[L_Y, R_Y].

Which of the following is the best description of the relationship between the endpoints of these confidence intervals? Note that the symbol \approx means “approximately equal.”

Answer:

  • (i) 12 * (yes_resample.get("Rent").mean() - no_resample.get("Rent").mean())
  • (ii) L_Y \approx 12 \cdot L_M and R_Y \approx 12 \cdot R_M

For both L_Y and R_Y, we cannot say that we certainly know that it will be precisely 12 times the value of the average monthly cost. Because every month and year has variablity/noise, we cannot say for certain that it will most definitely be 12 times the value of average monthly cost, but instead will probably be approximately equal.

The bottom two choices flip the inequality and state that the average monthly cost is 12 times the value of the average yearly cost, which would be vastly different from one another.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 5.3

You’re concerned about the validity of your estimates because you think bigger apartments are more likely to have in-unit laundry for one bedroom apartments only.

If your concern is valid and it is true that bigger apartments are more likely to have in-unit laundry, how will your bootstrapped estimates for the average monthly cost of in-unit laundry for one bedroom apartments only compare to the values you computed in part (a) based on all the apts?

Answer: The estimates will be generally smaller than those you computed in part (a).

If we query the yes and no DataFrames to contain only one bedroom apartments, the average "Rent" of these two DataFrames will probably be smaller than the original DataFrames. Because these two DataFrames now have a smaller mean, their bootstraps are also likely to also be smaller than what it originally was.

Another way we can think of it is by first calling our original yes and no DataFrames as yes_population and no_population respectively. Now, if we take yes_population and no_population on a histogram, we’ll likely see higher magnitude "Rent" outliers. By removing these outliers, we are now in a scenario similar to what the question asks. By taking this smaller subset that doesn’t have outliers and bootstrap, we will most likely get a smaller estimate than that seen from yes_population and no_population bootstraps.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 5.4

Consider the distribution of laundry_stats as computed in part (a). How would this distribution change if we:

  1. Increased the number of repetitions to 100,000?
  1. Started with only half of the rows in apts?

Answer:

    1. The distribution would not change significantly.
    1. The distribution would be wider.
  1. When the number of repetitions are increased, the overall distribution will end up looking the same. If anything, increasing the number of repetitions would make the bootstrap distribution look more like the true population distribution.

  2. If only half of the rows are used, there would be more variability in the bootstrap, leading to a wider distribution.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Source: su24-final — Q3

Problem 6

We want to estimate the mean distance of Tour de France stages by bootstrapping 10,000 times and constructing a 90% confidence interval for the mean. In this question, suppose random_stages is a random sample of size 500 drawn with replacement from stages. Identify the line numbers with errors in the code below. In the adjacent box, point out the error by describing the mistake in less than 10 words or writing a code snippet (correct only the part you think is wrong). You may or may not need all the spaces provided below to identify errors.


    line 1:      means = np.array([])
    line 2: 
    line 3:      for i in 10000:
    line 4:          resample = random_stages.sample(10000)
    line 5:          resample_mean = resample.get("Distance").mean()
    line 6:          np.append(means, resample_mean)
    line 7:    
    line 8:      left_bound = np.percentile(means, 0)
    line 9:      right_bound = np.percentile(means, 90)

Answer:

  • a): 3: for i in np.arange(10000):
    • The for loop syntax is incorrect. 10000 is an integer, not an iterable. To iterate 10,000 times, np.arange(10000) must be used.
  • b): 4: random_stages.sample(500, replace=True)
    • The bootstrap sample size should be 500 (matching the original sample size). Additionally, replace=True is required for sampling with replacement.
  • c): 6: means = np.append(means, resample_mean)
    • np.append does not modify the array in place. The means array must be reassigned to include the new value.
  • d): 8: np.percentile(means, 5)
    • A 90% confidence interval captures the middle 90% of the data or distribution. This means we exclude 10% of the data: 5% from the lower tail and 5% from the upper tail. The 0th percentile is incorrect for a 90% confidence interval. The lower bound should be the 5th percentile.
  • e): 9: np.percentile(means, 95)
    • The 90th percentile is incorrect for a 90% confidence interval. The upper bound should be the 95th percentile.
  • f): N/A: No more errors.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 7

True or False: Suppose that from a sample, you compute a 95% bootstrapped confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.

Answer: False

A 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Note that the problem specifies that the confidence interval is bootstrapped. Since the interval is found using bootstrapping, L and R averaged will not be the mean of the original sample since the mean of the original sample is not what is used in calculating the bootstrapped confidence interval. The bootstrapped confidence interval is created by re-sampling the data with replacement over and over again. Thus, while the interval is typically centered around the sample mean due to the nature of bootstrapping, the average of L and R (the 2.5th and 97.5th percentiles of the distribution of bootstrapped means) may not exactly equal the sample mean, but should be close to it. Additionally, L is the 2.5th percentile of the distribution of bootstrapped means and R is the 97.5th percentile, and these are not necessarily the same distance away from the mean of the sample.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 8

Suppose you do an experiment in which you do some random process 500 times and calculate the value of some statistic, which is a count of how many times a certain phenomenon occurred out of the 500 trials. You repeat the experiment 10,000 times and draw a histogram of the 10,000 statistics.


Problem 8.1

Is this histogram a probability histogram or an empirical histogram?

Answer: empirical histogram

Empirical histograms refer to distributions of observed data. Since the question at hand is conducting an experiment and creating a histogram of observed data from these trials the correct anwser is an empirical histogram.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 8.2

If you instead repeat the experiment 100,000 times, how will the histogram change?

Answer: it will barely change at all

Doing more of an experiment will barely change the histogram. The parameter we are trying to estimate through our experiment is some statistic. The number of experiments has no effect on the histograms distribution since the value of some statistic is not becoming more random.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 8.3

For each experiment, if you instead do the random process 5,000 times, how will the histogram change?

Answer: it will become wider

By increasing the number of random process we increase the possible range of values from 500 to 5000. The statistic being calculated is the count of how many times a phenomenon occurs. If the number of random process increases 10x the statistic can now take values ranging from [0, 5000] instead of [0, 500] which will clearly make the histogram width wider (due to the wider range of values it can take).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.



Source: wi24-final — Q13

Problem 9

In our sample, we have data on 163 medals for the sport of table tennis. Based on our data, China seems to really dominate this sport, earning 81 of these medals.

That’s nearly half of the medals for just one country! We want to do a hypothesis test with the following hypotheses to see if this pattern is true in general, or just happens to be true in our sample.

Null: China wins half of Olympic table tennis medals.

Alternative: China does not win half of Olympic table tennis medals.


Problem 9.1

Why can these hypotheses be tested by constructing a confidence interval?

Answer: Since the test aims to determine whether a parameter is equal to a fixed value

The goal of a confidence interval is to provide a range of values that, given the data, are considered plausible for the parameter in question. If the null hypothesis’ fixed value does not fall within this interval, it suggests that the observed data is not very compatible with the null hypothesis. Thus in our case, if a 95% confidence interval for the proportion of medals won by China does not include ~0.5, then there’s statistical evidence at the 5% significance level to suggest that China does not win exactly half of the medals. So again in our case, confidence intervals work to test this hypothesis because we are attempting to find out whether or half of the medals (0.5) lies within our interval at the 95% confidence level.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.



Problem 9.2

Suppose we construct a 95% bootstrapped CI for the proportion of Olympic table tennis medals won by China. Select all true statements.

Answer: If we resampled our original sample and calculated the proportion of Olympic table tennis medals won by China in our resample, there is approximately a 95% chance our interval would contain this number.

The second option is the only correct answer because it accurately describes the process and interpretation of a bootstrap confidence interval. A 95% bootstrapped confidence interval means that if we repeatedly sampled from our original sample and constructed the interval each time, approximately 95% of those intervals would contain the true parameter. This statement does not imply that the true proportion has a 95% chance of falling within any single interval we construct; instead, it reflects the long-run proportion of such intervals that would contain the true proportion if we could repeat the process indefinitely. Thus, the confidence interval gives us a method to estimate the parameter with a specified level of confidence based on the resampling procedure.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 9.3

True or False: In this scenario, it would also be appropriate to create a 95% CLT-based confidence interval.

Answer: True

The statement is true because the Central Limit Theorem (CLT) applies to the sampling distribution of the proportion, given that the sample size is large enough, which in our case, with 163 medals, it is. The CLT asserts that the distribution of the sample mean (or proportion, in our case) will approximate a normal distribution as the sample size grows, allowing the use of standard methods to create confidence intervals. Therefore, a CLT-based confidence interval is appropriate for estimating the true proportion of Olympic table tennis medals won by China.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.



Problem 9.4

True or False: If our 95% bootstrapped CI came out to be [0.479, 0.518], we would reject the null hypothesis at the 0.05 significance level.

Answer: False

This is false, we would fail to reject the null hypothesis because the interval [0.479, 0.518] includes the value of 0.5, which corresponds to the null hypothesis that China wins half of the Olympic table tennis medals. If the confidence interval contains the hypothesized value, there is not enough statistical evidence to reject the null hypothesis at the specified significance level. In this case, the data does not provide sufficient evidence to conclude that the proportion of medals won by China is different from 0.5 at the 0.05 significance level.


Difficulty: ⭐️

The average score on this problem was 92%.



Problem 9.5

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, the confidence interval we’d create would be wider.

Answer: True

Lowering the significance level means that you require more evidence to reject the null hypothesis, thus seeking a higher confidence in your interval estimate. A higher confidence level corresponds to a wider interval because it must encompass a larger range of values to ensure that it contains the true population parameter with the increased probability. Thus as we lower the significance level, the interval we create will be wider, making this statement true.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 9.6

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, we would be more likely to conclude a statistically significant result.

Answer: False

This statement is false. A small significance level lowers the chance of getting a statistically significant result; our value for 0.01 significance has to be outside a 99% confidence interval to be statistically significant. In addition, the true parameter was already contained within the tighter 95% confidence interval, so we failed to reject the null hypothesis at the 0.05 significance level. This guarantees failing to reject the null hypotehsis at the 0.01 significance level since we know that whatever is contained in a 95% confidence interval has to also be contained in a 99% confidence interval. Thus, this answer is false.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.



Source: wi25-final — Q2

Problem 10

The Death Eaters are a powerful group of dark wizards who oppose Harry Potter and his allies. Each Death Eater receives a unique identification number based on their order of initiation, ranging from 1 to N, where N represents the total number of Death Eaters.

Your task is to estimate the value of N so you can understand how many enemies you face. You have a random sample of identification numbers in a DataFrame named death_eaters containing a single column called "ID".


Problem 10.1

Which of the options below would be an appropriate estimate for the total number of Death Eaters? Select all that apply.

Answer: death_eaters.get("ID").max() and int(death_eaters.get("ID").mean() * 2)

  • Option 1: death_eaters.get("ID").max() returns the maximum ID from the sample. This is an appropriate estimate since the population size must be at least the size of the largest ID in our sample. For instance, if the maximum ID observed is 250, then the total number of Death Eaters must be at least 250.

  • Option 2: death_eaters.get("ID").sum() returns the sum of all ID numbers in the sample. The total sum of IDs has no meaningful connection to the population size, which makes this an inappropriate estimate.

  • Option 3: death_eaters.groupby("ID").count() groups the data by ID and counts occurrences. Since each ID is unique and death_eaters only includes the "ID" column, grouping simply shows that each ID appears once. This is not an appropriate estimate for N.

  • Option 4: int(death_eaters.get("ID").mean() * 2) returns twice the mean of the sample IDs as an integer. The mean of a random sample of the numbers 1 through N usually falls about halfway between 1 and N. So we can appropriately estimate N by doubling this mean.

  • Option 5: death_eaters.shape[0] returns the number of rows in death_eaters (ie. the sample size). The sample size does not reflect the total population size, making it an inappropriate estimate.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 10.2

Each box that you selected in part (a) is an example of what?

Answer: a statistic

The options in part (a) calculate a numerical value from the random sample death_eaters. This fits the definition of a statistic.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 10.3

Suppose you have access to a function called estimate, which takes in a Series of Death Eater ID numbers and returns an estimate for N. Fill in the blanks below to do the following:

    boot_estimates = np.array([])
    
    for i in np.arange(10000):
        boot_estimates = np.append(boot_estimates, __(a)__)

    left_72 = __(b)__
    

What goes in blank (a)?

Answer: estimate(death_eaters.sample(death_eaters.shape[0], replace=True).get("ID"))

In the given code, we use a for loop to generate 10,000 bootstrapped estimates of N and append them to the array boot_estimates. Blank (a) specifically computes one bootstrapped estimate of N. Here’s how key parts of the solution work:

  • death_eaters.sample(death_eaters.shape[0], replace=True): To bootstrap, we need to resample the data with replacement. The sample() function (see here) takes as arguments the sample size (death_eaters.shape[0]) and whether to replace (replace=True).

  • .get("ID"): Since estimate() takes a Series as input, we need to extract the ID column from the resample.

  • estimate(): The resampled ID column is passed into the estimate() function to generate one bootstrapped estimate of N.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 10.4

What goes in blank (b)?

Answer: np.percentile(boot_estimates, 14)

A 72% confidence interval captures the middle 72% of our distribution. This leaves 28% of the data outside the interval, with 14% from the lower tail and 14% from the upper tail. Thus, the left endpoint corresponds to the 14th percentile of boot_estimates. The np.percentile() function (see here) takes as arguments the array to compute the percentile (boot_estimates) and the desired percentile (14).


Difficulty: ⭐️

The average score on this problem was 91%.