Lecture 17 — Practice

← return to practice.dsc10.com


Lecture 17 — Collected Practice Questions

Below are practice problems tagged for Lecture 17 (rendered directly from the original exam/quiz sources).


Problem 1


Problem 1.1

The Museum of Natural History has a large collection of dinosaur bones, and they know the approximate year each bone is from. They want to use this sample of dinosaur bones to estimate the total number of years that dinosaurs lived on Earth. We’ll make the assumption that the sample is a uniform random sample from the population of all dinosaur bones. Which statistic below will give the best estimate of the population parameter?

Answer: sample max - sample min

Our goal is to estimate the total number of years that dinosaurs lived on Earth. In other words, we want to know the range of time that dinosaurs lived on Earth, and by definition range = biggest value - smallest value. By using “sample max - sample min”, we calculate the difference between the earliest and the latest dinosaur bones in this uniform random sample, which helps us to estimate the population range.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 1.2

The curator at the Museum of Natural History, who happens to have taken a data science course in college, points out that the estimate of the parameter obtained from this sample could certainly have come out differently, if the museum had started with a different sample of bones. The curator suggests trying to understand the distribution of the sample statistic. Which of the following would be an appropriate way to create that distribution?

Answer: neither bootstrapping nor the Central Limit Theorem

Recall, the Central Limit Theorem (CLT) says that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn. Thus, the theorem only applies when our sample statistics is sum or average, while in this question, our statistics is range, so CLT does not apply.

Bootstrapping is a valid technique for estimating measures of central tendency, e.g. the population mean, median, standard deviation, etc. It doesn’t work well in estimating extreme or sensitive values, like the population maximum or minimum. Since the statistic we’re trying to estimate is the difference between the population maximum and population minimum, bootstrapping is not appropriate.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 20%.



Problem 2

The DataFrame apps contains application data for a random sample of 1,000 applicants for a particular credit card from the 1990s. The "age" column contains the applicants’ ages, in years, to the nearest twelfth of a year.

The credit card company that owns the data in apps, BruinCard, has decided not to give us access to the entire apps DataFrame, but instead just a random sample of 100 rows of apps called hundred_apps.

We are interested in estimating the mean age of all applicants in apps given only the data in hundred_apps. The ages in hundred_apps have a mean of 35 and a standard deviation of 10.


Problem 2.1

Give the endpoints of the CLT-based 95% confidence interval for the mean age of all applicants in apps, based on the data in hundred_apps.

Answer: Left endpoint = 33, Right endpoint = 37

According to the Central Limit Theorem, the standard deviation of the distribution of the sample mean is \frac{\text{sample SD}}{\sqrt{\text{sample size}}} = \frac{10}{\sqrt{100}} = 1. Then using the fact that the distribution of the sample mean is roughly normal, since 95% of the area of a normal curve falls within two standard deviations of the mean, we can find the endpoints of the 95% CLT-based confidence interval as 35 - 2 = 33 and 35 + 2 = 37.

We can think of this as using the formula below: \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \: \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]. Plugging in the appropriate quantities yields [35 - 2\cdot\frac{10}{\sqrt{100}}, 35 - 2\cdot\frac{10}{\sqrt{100}}] = [33, 37].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 2.2

BruinCard reinstates our access to apps so that we can now easily extract information about the ages of all applicants. We determine that, just like in hundred_apps, the ages in apps have a mean of 35 and a standard deviation of 10. This raises the question of how other samples of 100 rows of apps would have turned out, so we compute 10,000 sample means as follows.

    sample_means = np.array([])
    for i in np.arange(10000):
        sample_mean = apps.sample(100, replace=True).get("age").mean()
        sample_means = np.append(sample_means, sample_mean)

Which of the following three visualizations best depict the distribution of sample_means?


Answer: Option 1

As we found in the previous part, the distribution of the sample mean should have a standard deviation of 1. We also know it should be centered at the mean of our sample, at 35, but since all the options are centered here, that’s not too helpful. Only Option 1, however, has a standard deviation of 1. Remember, we can approximate the standard deviation of a normal curve as the distance between the mean and either of the inflection points. Only Option 1 looks like it has inflection points at 34 and 36, a distance of 1 from the mean of 35.

If you chose Option 2, you probably confused the standard deviation of our original sample, 10, with the standard deviation of the distribution of the sample mean, which comes from dividing that value by the square root of the sample size.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 2.3

Which of the following statements are guaranteed to be true? Select all that apply.

Answer: A CLT-based 90% confidence interval for the mean age of credit card applicants, based on the data in hundred_apps, would be narrower than the interval you gave in part (a).

Let’s analyze each of the options:

  • Option 1: We are not using bootstrapping to compute sample means since we are sampling from the apps DataFrame, which is our population here. If we were bootstrapping, we’d need to sample from our first sample, which is hundred_apps.

  • Option 2: We can’t be sure what the distribution of the ages of credit card applicants are. The Central Limit Theorem says that the distribution of sample_means is roughly normally distributed, but we know nothing about the population distribution.

  • Option 3: The CLT-based 95% confidence interval that we calculated in part (a) was computed as follows: \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] A CLT-based 90% confidence interval would be computed as \left[\text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] for some value of z less than 2. We know that 95% of the area of a normal curve is within two standard deviations of the mean, so to only pick up 90% of the area, we’d have to go slightly less than 2 standard deviations away. This means the 90% confidence interval will be narrower than the 95% confidence interval.

  • Option 4: The left endpoint of the interval from part (a) was calculated using the Central Limit Theorem, whereas using np.percentile(sample_means, 2.5) is calculated empirically, using the data in sample_means. Empirically calculating a confidence interval doesn’t necessarily always give the exact same endpoints as using the Central Limit Theorem, but it should give you values close to those endpoints. These values are likely very similar but they are not guaranteed to be the same. One way to see this is that if we ran the code to generate sample_means again, we’d probably get a different value for np.percentile(sample_means, 2.5).

  • Option 5: The key observation is that if we used the data in hundred_apps to create 1,000 CLT-based 95% confidence intervals for the mean age of applicants in apps, all of these intervals would be exactly the same. Given a sample, there is only one CLT-based 95% confidence interval associated with it. In our case, given the sample hundred_apps, the one and only CLT-based 95% confidence interval based on this sample is the one we found in part (a). Therefore if we generated 1,000 of these intervals, either they would all contain the parameter or none of them would. In order for a statement like the one here to be true, we would need to collect 1,000 different samples, and calculate a confidence interval from each one.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.



Source: fa22-final — Q10

Problem 3

The credit card company that owns the data in apps, BruinCard, has decided not to give us access to the entire apps DataFrame, but instead just a random sample of 100 rows of apps called hundred_apps.

We are interested in estimating the mean age of all applicants in apps given only the data in hundred_apps. The ages in hundred_apps have a mean of 35 and a standard deviation of 10.


Problem 3.1

Give the endpoints of the CLT-based 95% confidence interval for the mean age of all applicants in apps, based on the data in hundred_apps.

Answer: Left endpoint = 33, Right endpoint = 37

According to the Central Limit Theorem, the standard deviation of the distribution of the sample mean is \frac{\text{sample SD}}{\sqrt{\text{sample size}}} = \frac{10}{\sqrt{100}} = 1. Then using the fact that the distribution of the sample mean is roughly normal, since 95% of the area of a normal curve falls within two standard deviations of the mean, we can find the endpoints of the 95% CLT-based confidence interval as 35 - 2 = 33 and 35 + 2 = 37.

We can think of this as using the formula below: \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \: \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]. Plugging in the appropriate quantities yields [35 - 2\cdot\frac{10}{\sqrt{100}}, 35 - 2\cdot\frac{10}{\sqrt{100}}] = [33, 37].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


BruinCard reinstates our access to apps so that we can now easily extract information about the ages of all applicants. We determine that, just like in hundred_apps, the ages in apps have a mean of 35 and a standard deviation of 10. This raises the question of how other samples of 100 rows of apps would have turned out, so we compute 10,000 sample means as follows.

    sample_means = np.array([])
    for i in np.arange(10000):
        sample_mean = apps.sample(100, replace=True).get("age").mean()
        sample_means = np.append(sample_means, sample_mean)


Problem 3.2

Which of the following three visualizations best depict the distribution of sample_means?


Answer: Option 1

As we found in the previous part, the distribution of the sample mean should have a standard deviation of 1. We also know it should be centered at the mean of our sample, at 35, but since all the options are centered here, that’s not too helpful. Only Option 1, however, has a standard deviation of 1. Remember, we can approximate the standard deviation of a normal curve as the distance between the mean and either of the inflection points. Only Option 1 looks like it has inflection points at 34 and 36, a distance of 1 from the mean of 35.

If you chose Option 2, you probably confused the standard deviation of our original sample, 10, with the standard deviation of the distribution of the sample mean, which comes from dividing that value by the square root of the sample size.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 3.3

Which of the following statements are guaranteed to be true? Select all that apply.

Answer: A CLT-based 90% confidence interval for the mean age of credit card applicants, based on the data in hundred_apps, would be narrower than the interval you gave in part (a).

Let’s analyze each of the options:

  • Option 1: We are not using bootstrapping to compute sample means since we are sampling from the apps DataFrame, which is our population here. If we were bootstrapping, we’d need to sample from our first sample, which is hundred_apps.

  • Option 2: We can’t be sure what the distribution of the ages of credit card applicants are. The Central Limit Theorem says that the distribution of sample_means is roughly normally distributed, but we know nothing about the population distribution.

  • Option 3: The CLT-based 95% confidence interval that we calculated in part (a) was computed as follows: \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] A CLT-based 90% confidence interval would be computed as \left[\text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] for some value of z less than 2. We know that 95% of the area of a normal curve is within two standard deviations of the mean, so to only pick up 90% of the area, we’d have to go slightly less than 2 standard deviations away. This means the 90% confidence interval will be narrower than the 95% confidence interval.

  • Option 4: The left endpoint of the interval from part (a) was calculated using the Central Limit Theorem, whereas using np.percentile(sample_means, 2.5) is calculated empirically, using the data in sample_means. Empirically calculating a confidence interval doesn’t necessarily always give the exact same endpoints as using the Central Limit Theorem, but it should give you values close to those endpoints. These values are likely very similar but they are not guaranteed to be the same. One way to see this is that if we ran the code to generate sample_means again, we’d probably get a different value for np.percentile(sample_means, 2.5).

  • Option 5: The key observation is that if we used the data in hundred_apps to create 1,000 CLT-based 95% confidence intervals for the mean age of applicants in apps, all of these intervals would be exactly the same. Given a sample, there is only one CLT-based 95% confidence interval associated with it. In our case, given the sample hundred_apps, the one and only CLT-based 95% confidence interval based on this sample is the one we found in part (a). Therefore if we generated 1,000 of these intervals, either they would all contain the parameter or none of them would. In order for a statement like the one here to be true, we would need to collect 1,000 different samples, and calculate a confidence interval from each one.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.



Source: fa24-final — Q4

Problem 4

Matilda has been working at Bill’s Book Bonanza since it first opened and her shifts for each week are always randomly scheduled (i.e., she does not work the same shifts each week). Due to the system restrictions at the bookstore, when Matilda logs in with her employee ID, she only has access to the history of sales made during her shifts. Suppose these transactions are stored in a DataFrame called matilda, which has the same columns as the sales DataFrame that stores all transactions.

Matilda wants to use her random sample to estimate the mean price of books purchased with cash at Bill’s Book Bonanza. For the purposes of this question, assume Matilda only has access to matilda, and not all of sales.


Problem 4.1

Complete the code below so that cash_left and cash_right store the endpoints of an 86\% bootstrapped confidence interval for the mean price of books purchased with cash at Bill’s Book Bonanza.

    cash_means = np.array([])
    original = __(a)__
    for i in np.arange(10000):
        resample = original.sample(__(b)__)
        cash_means = np.append(cash_means, __(c)__)

    cash_left = __(d)__
    cash_right = __(e)__

  1. matilda[matilda.get("cash")]

We first filter the matilda DataFrame to only include rows where the payment method was cash, since these are the only rows that are relevant to our question of the mean price of books purchased with cash.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.

  1. original.shape[0], replace = True

We repeatedly resample from this filtered data. We always bootstrap using the same sample size as the original sample, and with replacement.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.

  1. resample.get("price").mean()

Our statistic is the mean price, which we need to calculate from the resample DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.

  1. np.percentile(cash_means, 7)

Since our goal is to construct an 86% confidence interval, we take the 7^{th} percentile as the left endpoint and the 93^{rd} percentile as the right endpoint. These numbers come from 100 - 86 = 14, which means we want 14\% of the area excluded from our interval, and we want to split that up evenly with 7\% on each side.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.

  1. np.percentile(cash_means, 93)

Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 4.2

Next, Matilda uses the data in matilda to construct a 95\% CLT-based confidence interval for the same parameter.

Given that there are 400 cash transactions in matilda and her confidence interval comes out as [19.58, 21.58], what is the standard deviation of the prices of all cash transactions at Bill’s Book Bonanza?

Answer: 10

The width of the interval is 21.58 - 19.58 = 2. Using the formula for the width of a 95\% CLT-based confidence interval and solving for the standard deviation gives the answer.

Width = 4 * \frac{SD}{\sqrt{400}}

SD = 2 * \frac{\sqrt{400}}{4} = 10


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 4.3

Knowing the endpoints of Matilda’s 95% CLT-based confidence interval can actually help us to determine the endpoints of her 86% bootstrapped confidence interval. You may also need to know the following facts:

Estimate the value of cash_left to one decimal place.

Answer: 19.8

The useful fact here is the second fact. It says that if you step 1.5 standard deviations to the right of the mean in a normal distribution, 93% of the area will be to the left. Since that leaves 7% of the area to the right, this also means that the area to the left of -1.5 standard units is 7%, by symmetry. So the area between -1.5 and 1.5 standard units is 86% of the area, or the bounds for an 86% confidence interval. This is the grey area in the picture below.

The middle of our interval is 20.58 so we calculate the left endpoint as:

\begin{align*} \texttt{cash\_left} &= 20.58 - 1.5 * \frac{SD}{\sqrt{n}}\\ &= 20.58 - 1.5 * \frac{10}{\sqrt{400}}\\ &= 20.58 - 0.75 \\ &= 19.83\\ &\approx 19.8 \end{align*}


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 22%.


Problem 4.4

Which of the following are valid conclusions? Select all that apply.

Answer: Approximately 95% of the values in cash_means fall within the interval [19.58, 21.58].

Choice 1 is correct because of the way bootstrapped confidence intervals are created. If we made a 95% bootstrapped confidence interval from cash_means, it would capture 95% of the values in cash_means, by design. This is not how the interval [19.58, 21.58] was created, however; this interval comes from the CLT. However bootstrapped and CLT-based confidence intervals give very nearly the same results. They are two methods to solve the same problem and they create nearly identical confidence intervals. So it is correct to say that approximately 95% of the values in cash_means fall within a 95% CLT-based confidence interval, which is [19.58, 21.58].

Choice 2 is incorrect because the interval estimates the mean price, not individual book prices.

Choice 3 is incorrect because confidence intervals are about the reliability of the method: if we repeated the whole bootstrapping process many times, about 95% of the intervals would contain the true mean. It does not indicate the probability of the true mean being within this specific interval because this interval is fixed and the true mean is also fixed. It doesn’t make sense to talk about the probability of a fixed number falling in a fixed interval; that’s like asking if there’s a 95% chance that 5 is between 2 and 10.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.


Problem 4.5

Matilda has been wondering whether the mean price of books purchased with cash at Bill’s Book Bonanza is \$20. What can she conclude about this?

Answer: The mean price of books purchased with cash at Bill’s Book Bonanza could plausibly be $20.

This question is referring to a hypothesis test where we test whether parameter (in this case, the mean price of books purchased with cash) is equal to a specific value ($20). The hypothesis test can be conducted by constructing a confidence interval for the parameter and checking whether the specific value falls in the interval. Since $20 is within the confidence interval, it is a plausible value for the parameter, but never guaranteed.


Difficulty: ⭐️

The average score on this problem was 90%.



Source: fa24-final — Q9

Problem 5

As in the previous problem, suppose we are told that sales contains 1000 rows, 500 of which represent cash transactions and 500 of which represent non-cash transactions.

This time, instead of being given histograms, we are told that the distribution of "price" for cash transactions is roughly normal, with a mean of \$14 and a standard deviation of \$2. We’ll call this distribution the cash curve.

Additionally, the distribution of "price" for non-cash transactions is roughly normal, with a mean of \$22 and a standard deviation of \$4. We’ll call this distribution the non-cash curve.

We want to draw a curve representing the approximate distribution of "price" for all transactions combined. We’ll call this distribution the combined curve.


Problem 5.1

What is the approximate proportion of area under the cash curve between \$10 and \$12? Your answer should be a number between 0 and 1.

Answer: 0.135

We know that for a normal distribution, approximately

  • 68% of the data is found within one standard deviation of the mean (mean ± 1 SD).
  • 95% of the data falls within two standard deviations (mean ± 1 SD).
  • 99.7% of the data lies within three standard deviations (mean ± 1 SD).

For the cash curve, the mean is $14 and the standard deviation is $2, so using this rule:

  • The range from $12 to $16 (14 ± 2) covers one standard deviation from the mean, containing approximately 68% of the data. Since normal distributions are symmetric, the range from $12 to $14 covers half of this, or 34% of the total area.
  • The range from $10 to $18 (14 ± 4) covers two standard deviation from the mean, containing approximately 95% of the data. By symmetry, the range from $10 to $14 covers half of this, or 47.5% of the total area.

Area from $10 to $12 = Area from $10 to $14 - Area from $12 to $14 = 47.5% - 34% = 13.5% = 0.135


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.


Problem 5.2

Fill in the blanks in the code below so that the expression evaluates to the approximate proportion of area under the cash curve between \$14.50 and \$17.50. Each answer should be a single number.

scipy.stats.norm.cdf(a) - scipy.stats.norm.cdf(b)

(a). Answer: 1.75

(b). Answer: 0.25

The expression scipy.stats.norm.cdf(a) - scipy.stats.norm.cdf(b) calculates the area under the normal distribution curve between two points, where a is the right endpoint and b is the left endpoint. Here, we’ll let a and b represent the standardized values corresponding to $17.50 and $14.50, respectively.

We’ll use the formula for standard units:

x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}

For the standard units value corresponding to $17.50:

a = \frac{(17.50 - 14)}{2} = 1.75

For the standard units value corresponding to $14.50:

b = \frac{(14.50 - 14)}{2} = 0.25


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Problem 5.3

Will the combined curve be roughly normal?

Answer: No, not necessarily.

Although both distributions are roughly normal, their means are significantly apart - one centered at 14 (with a standard deviation of 2) and one centered at 22 (with a standard deviation of 4). The resulting distribution is likely bimodal (having two peaks). The combined distribution will have peaks near each of the original means, reflecting the characteristics of the two separate normal distributions.

The CLT says that “the distribution of possible sample means is approximately normal, no matter the distribution of the population,” but it doesn’t say anything about combining two distributions.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 38%.


Problem 5.4

Fill in the blanks in the code below so that the expression evaluates to the approximate proportion of area under the combined curve between \$14 and \$22. Each answer should be a single number.

(scipy.stats.norm.cdf(a) - scipy.stats.norm.cdf(b)) / 2

(a). Answer: 4

(b). Answer: -2

Since the the combined curve is a combination of two normal distributions, each representing equal number of transactions (500 each), we can approximate the total area by averaging the areas under the individual cash and non-cash curves between \$14 and \$22.

  1. For cash:

    P(Z ≤ \frac{22-14}{2}) - P(Z ≤ \frac{14-14}{2})

    P(Z ≤ 4) - P(Z ≤ 0) = scipy.stats.norm.cdf(4) - scipy.stats.norm.cdf(0)

  2. Non-cash:

    P(Z ≤ \frac{22-22}{4}) - P(Z ≤ \frac{14-22}{4})

    P(Z ≤ 0) - P(Z ≤ -2) = scipy.stats.norm.cdf(0) - scipy.stats.norm.cdf(-2)

Averaging area under the two distributions:

\frac{1}{2} (scipy.stats.norm.cdf(4) - scipy.stats.norm.cdf(0) + scipy.stats.norm.cdf(0) - scipy.stats.norm.cdf(-2)) = (scipy.stats.norm.cdf(4) - scipy.stats.norm.cdf(-2)) / 2


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 15%.


Problem 5.5

What is the approximate proportion of area under the combined curve between \$14 and \$22? Choose the closest answer below.

Answer: 0.49

We can approximate the total area by averaging the areas under the individual cash and non-cash curves between \$14 and \$22.

  1. For cash: mean = $14, SD = $2.
    • $22 is 4 standard deviations (SD) from the mean of $14.
    • Half of the area from the mean to 4 SDs above the mean in a normal distribution is approximately 50%.
  2. Non-cash: mean = $22, SD = $4.
    • $14 is 2 SD below the mean of $22.
    • The area within 2 SD from mean is 95%, so the area from $14 to the mean (22) is half of 95%: 47.5%.

Averaging the two individual areas: \frac{0.5 + 0.475}{2} = 0.4875, which is closest to 0.49.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 24%.



Problem 6

An IKEA employee has access to a data set of the purchase amounts for 40,000 customer transactions. This data set is roughly normally distributed with mean 150 dollars and standard deviation 25 dollars.


Problem 6.1

Why is the distribution of purchase amounts roughly normal?

Answer: for some other reason

The data that we have is a sample of purchase amounts. It is not a sample mean or sample sum, so the Central Limit Theorem does not apply. The data just naturally happens to be roughly normally distributed, like human heights, for example.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.


Problem 6.2

Shiv spends 300 dollars at IKEA. How would we describe Shiv’s purchase in standard units?

Answer: 6 standard units

To standardize a data value, we subtract the mean of the distribution and divide by the standard deviation:

\begin{aligned} \text{standard units} &= \frac{300 - 150}{25} \\ &= \frac{150}{25} \\ &= 6 \end{aligned}

A more intuitive way to think about standard units is the number of standard deviations above the mean (where negative means below the mean). Here, Shiv spent 150 dollars more than average. One standard deviation is 25 dollars, so 150 dollars is six standard deviations.


Difficulty: ⭐️

The average score on this problem was 97%.


Problem 6.3

Give the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount, based on this data.

Answer: 149.75 and 150.25 dollars

The Central Limit Theorem tells us about the distribution of the sample mean purchase amount. That’s not the distribution the employee has, but rather a distribution that shows how the mean of a different sample of 40,000 purchases might have turned out. Specifically, we know the following information about the distribution of the sample mean.

  1. It is roughly normally distributed.
  2. Its mean is about 150 dollars, the same as the mean of the employee’s sample.
  3. Its standard deviation is about \frac{\text{sample standard deviation}}{\sqrt{\text{sample size}}}=\frac{25}{\sqrt{40000}} = \frac{25}{200} = \frac{1}{8}.

Since the distribution of the sample mean is roughly normal, we can find a 95% confidence interval for the sample mean by stepping out two standard deviations from the center, using the fact that 95% of the area of a normal distribution falls within 2 standard deviations of the mean. Therefore the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount are

  • 150 - 2*\frac{1}{8} = 149.75 dollars, and
  • 150 + 2*\frac{1}{8} = 150.25 dollars.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.



Problem 7

Oren has a random sample of 200 dog prices in an array called oren. He has also bootstrapped his sample 1,000 times and stored the mean of each resample in an array called boots.

In this question, assume that the following code has run:

a = np.mean(oren)
b = np.std(oren)
c = len(oren)


Problem 7.1

What expression best estimates the population’s standard deviation?

Answer: b

The function np.std directly calculated the standard deviation of array oren. Even though oren is sample of the population, its standard deviation is still a pretty good estimate for the standard deviation of the population because it is a random sample. The other options don’t really make sense in this context.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 7.2

Which expression best estimates the mean of boots?

Answer: a

Note that a is equal to the mean of oren, which is a pretty good estimator of the mean of the overall population as well as the mean of the distribution of sample means. The other options don’t really make sense in this context.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 7.3

What expression best estimates the standard deviation of boots?

Answer: b / np.sqrt(c)

Note that we can use the Central Limit Theorem for this problem which states that the standard deviation (SD) of the distribution of sample means is equal to (population SD) / np.sqrt(sample size). Since the SD of the sample is also the SD of the population in this case, we can plug our variables in to see that b / np.sqrt(c) is the answer.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 7.4

What is the dog price of $560 in standard units?

Answer: (560 - a) / b

To convert a value to standard units, we take the value, subtract the mean from it, and divide by SD. In this case that is (560 - a) / b, because a is the mean of our dog prices sample array and b is the SD of the dog prices sample array.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 7.5

The distribution of boots is normal because of the Central Limit Theorem.

Answer: True

True. The central limit theorem states that if you have a population and you take a sufficiently large number of random samples from the population, then the distribution of the sample means will be approximately normally distributed.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 7.6

If Oren’s sample was 400 dogs instead of 200, the standard deviation of boots will…

Answer: Decrease by a factor of \sqrt{2}

Recall that the central limit theorem states that the STD of the sample distribution is equal to (population STD) / np.sqrt(sample size). So if we increase the sample size by a factor of 2, the STD of the sample distribution will decrease by a factor of \sqrt{2}.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 7.7

If Oren took 4000 bootstrap resamples instead of 1000, the standard deviation of boots will…

Answer: None of the above

Again, from our formula given by the central limit theorem, the sample STD doesn’t depend on the number of bootstrap resamples so long as it’s “sufficiently large”. Thus increasing our bootstrap sample from 1000 to 4000 will have no effect on the std of boots


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 7.8

Write one line of code that evaluates to the right endpoint of a 92% CLT-Based confidence interval for the mean dog price. The following expressions may help:

stats.norm.cdf(1.75) # => 0.96
stats.norm.cdf(1.4)  # => 0.92

Answer: a + 1.75 * b / np.sqrt(c)

Recall that a 92% confidence interval means an interval that consists of the middle 92% of the distribution. In other words, we want to “chop” off 4% from either end of the ditribution. Thus to get the right endpoint, we want the value corresponding to the 96th percentile in the mean dog price distribution, or mean + 1.75 * (SD of population / np.sqrt(sample size) or a + 1.75 * b / np.sqrt(c) (we divide by np.sqrt(c) due to the central limit theorem). Note that the second line of information that was given stats.norm.cdf(1.4) is irrelavant to this particular problem.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.



Source: su24-final — Q5

Problem 8

In this question, suppose random_stages is a random sample of undetermined size drawn with replacement from stages. We want to estimate the proportion of stage wins won by each country.


Problem 8.1

Suppose we extract the winning countries and store the resulting Series. Consider the variable winners defined below, which you may use throughout this question:

winners = random_stages.get("Winner Country")

Write a single line of code that evaluates to the proportion of stages in random_stages won by France (country code "FRA").

Answer: np.mean(winners == "FRA") or np.count_nonzero(winners == "FRA") / len(winners)

winners == "FRA" creates a Boolean array where each element is True if the corresponding value in the winners Series equals "FRA", and False otherwise. In Python, True is equivalent to 1 and False is equivalent to 0 when used in numerical operations. np.mean(winners == "FRA") computes the average of this Boolean array, which is equivalent to the proportion of True values (i.e., the proportion of stages won by "FRA").

Alternatively, you can use np.count_nonzero(winners == "FRA") / len(winners), which counts the number of True values and divides by the total number of entries to compute the proportion.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 8.2

We want to generate a 95% confidence interval for the true proportion of wins by France in stages by using our random sample random_stages. How many rows need to be in random_stages for our confidence interval to have width of at most 0.03? Recall that the maximum standard deviation for any series of zeros and ones is 0.5. Do not simplify your answer.

Answer: 4 * \frac{0.5}{\sqrt{n}} \leq 0.03

For a 95% confidence interval for the population mean: \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

Note that the width of our CI is the right endpoint minus the left endpoint:

\text{width} = 4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}

Substitute the maximum standard deviation (sample SD = 0.5) for any series of zeros and ones and set the width to be at most 0.03: 4 \cdot \frac{0.5}{\sqrt{n}} \leq 0.03.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 8.3

Suppose we now want to test the hypothesis that the true proportion of stages won by Italy ("ITA") is 0.2 using a confidence interval and the Central Limit Theorem. We want to conduct our hypothesis test at a significance level of 0.01. Fill in the blanks to construct the confidence interval [interval_left, interval_right]. Your answer must use the Central Limit Theorem, not bootstrapping. Assume an integer variable sample_size = len(winners) has been defined, regardless of your answer to part 2.

Hint:

stats.norm.cdf(2.576) - stats.norm.cdf(-2.576) = 0.99
    interval_center = __(i)__
    mystery = __(ii)__ * np.std(__(iii)__ ) / __(iv)__
    interval_left = interval_center - mystery
    interval_right = interval_center + mystery

The confidence interval for the true proportion is given by: \left[ \text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

  • (i): np.mean(winners == "ITA") or (winners == "ITA").mean()
    • The center of the confidence interval is the sample mean proportion of stages won by Italy. This is calculated by taking the mean of the Boolean array where winners equals "ITA". The Boolean array has values 1 (if true) or 0 (if false), so the mean directly represents the proportion.
  • (ii): 2.576
    • This is the critical value corresponding to a 99% confidence level. To have 99% of the data between -z and z, the area under the curve outside of this range is: 1 - 0.99 = 0.01, split equally between the two tails (0.005 in each tail). This means: P(-z \leq Z \leq z) = 0.99.

    • The CDF at z = 2.576 captures 99.5% of the area to the left of 2.576: \text{stats.norm.cdf}(2.576) \approx 0.995.

    • Similarly, the CDF at z = -2.576 captures 0.5% of the area to the left: \text{stats.norm.cdf}(-2.576) \approx 0.005.

    • The total area between -2.576 and 2.576 is: \text{stats.norm.cdf}(2.576) - \text{stats.norm.cdf}(-2.576) = 0.995 - 0.005 = 0.99.

      This confirms that z = 2.576 is the correct critical value for a 99% confidence level.

  • (iii): winners == "ITA"
    • The standard deviation is calculated using the Boolean array where winners equals "ITA".
  • (iv): np.sqrt(sample_size)
    • The denominator is the square root of the sample size. This is consistent with the Central Limit Theorem.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.


Problem 8.4

What is our null hypothesis?

Answer: The true proportion of stages won by Italy is 0.2.

The null hypothesis assumes that there is no difference or effect. Here, it states that the true proportion of stages won by Italy equals 0.2.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 8.5

What is our alternative hypothesis?

Answer: The true proportion of stages won by Italy is not 0.2.

The alternative hypothesis is the opposite of the null hypothesis. Here, we test whether the true proportion of stages won by Italy is different from 0.2.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 8.6

Suppose we calculated the interval [0.195, 0.253] using the above process. Should we reject or fail to reject our null hypothesis?

Answer: Fail to reject.

The confidence interval calculated is [0.195, 0.253], and the null hypothesis value (0.2) lies within this interval. This means that 0.2 is a plausible value for the true proportion at the 99% confidence level. Therefore, we do not have sufficient evidence to reject the null hypothesis.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.



Problem 9

In a board game, whenever it is your turn, you roll a six-sided die and move that number of spaces. You get 10 turns, and you win the game if you’ve moved 50 spaces in those 10 turns. Suppose you create a simulation, based on 10,000 trials, to show the distribution of the number of spaces moved in 10 turns. Let’s call this distribution Dist10. You also wonder how the game would be different if you were allowed 15 turns instead of 10, so you create another simulation, based on 10,000 trials, to show the distribution of the number of spaces moved in 15 turns, which we’ll call Dist15

Problem 9.1

What can we say about the shapes of Dist10 and Dist15?

Answer: both will be roughly normally distributed

By the central limit theorem, both simulations will appear to be roughly normally distributed.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 9.2

What can we say about the centers of Dist10 and Dist15?

Answer: the mean of Dist10 will be smaller than the mean of Dist15

The distribution which moves in 10 turns will have a smaller mean as there are less turns to move spaces. Therefore, the mean movement from turns will naturally be higher for the distribution with more turns.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 9.3

What can we say about the spread of Dist10 and Dist15?

Answer: the standard deviation of Dist10 will be smaller than the standard deviation of Dist15

Since taking more turns allows for more possible values, the spread of Dist10 will be smaller than the standard deviation of Dist15. (ie. consider the possible range of values that are attainable for each case)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.



Problem 10

From a population with mean 500 and standard deviation 50, you collect a sample of size 100. The sample has mean 400 and standard deviation 40. You bootstrap this sample 10,000 times, collecting 10,000 resample means.


Problem 10.1

Which of the following is the most accurate description of the mean of the distribution of the 10,000 bootstrapped means?

Answer: The mean will be approximately equal to 400.

The distribution of bootstrapped means’ mean will be approximately 400 since that is the mean of the sample and bootstrapping is taking many samples of the original sample. The mean will not be exactly 400 do to some randomness though it will be very close.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Problem 10.2

Which of the following is closest to the standard deviation of the distribution of the 10,000 bootstrapped means?

Answer: 4

To find the standard deviation of the distribution, we can take the sample standard deviation S divided by the square root of the sample size. From plugging in, we get 40 / 10 = 4.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.



Problem 11

True or False: Suppose that from a sample, you compute a 95% normal confidence interval for a population parameter to be the interval [L, R]. Then the average of L and R is the mean of the original sample.

Answer: True

True, a 95% confidence interval indicates we are 95% confident that the true population parameter falls within the interval [L, R]. Looking at how a confidence interval is calculated is by adding/ subtracting a confidence level value (z) by the standard error. Since the top and bottom of the interval will be different from the mean by the same amount, the average will be the mean. (For more information, refer to the reference sheet)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


Problem 12

The season DataFrame contains statistics on all players in the WNBA in the 2021 season. The first few rows of season are shown below.

Each row in season corresponds to a single player. In this problem, we’ll be working with the 'PPG' column, which contains the number of points scored per game played.

Suppose we only have access to the DataFrame small_season, which is a random sample of size 36 from season. We’re interested in learning about the true mean points per game of all players in season given just the information in small_season.

To start, we want to bootstrap small_season 10,000 times and compute the mean of the resample each time. We want to store these 10,000 bootstrapped means in the array boot_means.

Here is a broken implementation of this procedure.

boot_means = np.array([])                                           
for i in np.arange(10000):                                          
    resample = small_season.sample(season.shape[0], replace=False)  # Line 1
    resample_mean = small_season.get('PPG').mean()                  # Line 2
    np.append(boot_means, new_mean)                                 # Line 3

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.


Problem 12.1

What is incorrect about Line 1? Select all that apply.

Answers:

  • The sample size is season.shape[0], when it should be small_season.shape[0]
  • Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from small_season. When bootstrapping, we sample with replacement from our original sample, with a sample size that’s equal to the original sample’s size. Here, our original sample is small_season, so we should be taking samples of size small_season.shape[0] from it.

Option 1 is incorrect; season has nothing to do with this problem, as we are bootstrapping from small_season.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 12.2

What is incorrect about Line 2? Select all that apply.

Answer: Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in resample

The current implementation of Line 2 doesn’t use the resample at all, when it should. If we were to leave Line 2 as it is, all of the values in boot_means would be identical (and equal to the mean of the 'PPG' column in small_season).

Option 1 is incorrect since our bootstrapping procedure is independent of season. Option 3 is incorrect because .mean() is a valid Series method.


Difficulty: ⭐️

The average score on this problem was 98%.


Problem 12.3

What is incorrect about Line 3? Select all that apply.

Answers:

  • The result of calling np.append is not being reassigned to boot_means, so boot_means will be an empty array after running this procedure
  • new_mean is not a defined variable name, and should be replaced with resample_mean

np.append returns a new array and does not modify the array it is called on (boot_means, in this case), so Option 1 is a necessary fix. Furthermore, Option 3 is a necessary fix since new_mean wasn’t defined anywhere.

Option 2 is incorrect; if np.append were outside of the for-loop, none of the 10,000 resampled means would be saved in boot_means.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 12.4

We construct a 95% confidence interval for the true mean points per game for all players by taking the middle 95% of the bootstrapped sample means.

left_b = np.percentile(boot_means, 2.5)
right_b = np.percentile(boot_means, 97.5)
boot_ci = [left_b, right_b]

Select the most correct statement below.

Answer: (left_b + right_b) / 2 is not necessarily equal to the mean points per game in small_season, but is close.

Normal-based confidence intervals are of the form [\text{mean} - \text{something}, \text{mean} + \text{something}]. In such confidence intervals, it is the case that the average of the left and right endpoints is exactly the mean of the distribution used to compute the interval.

However, the confidence interval we’ve created is not normal-based, rather it is bootstrap-based! As such, we can’t say that anything is exactly true; this rules out Options 1 and 3.

Our 95% confidence interval was created by taking the middle 95% of bootstrapped sample means. The distribution of bootstrapped sample means is roughly normal, and the normal distribution is symmetric (the mean and median are both equal, and represent the “center” of the distribution). This means that the middle of our 95% confidence interval should be roughly equal to the mean of the distribution of bootstrapped sample means. This implies that Option 4 is correct; the difference between Options 2 and 4 is that Option 4 uses small_season, which is the sample we bootstrapped from, while Option 2 uses season, which was not accessed at all in our bootstrapping procedure.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.



Problem 13

Suppose we only have access to the DataFrame small_season, which is a random sample of size 36 from season. We’re interested in learning about the true mean points per game of all players in season given just the information in small_season.

To start, we want to bootstrap small_season 10,000 times and compute the mean of the resample each time. We want to store these 10,000 bootstrapped means in the array boot_means.

Here is a broken implementation of this procedure.

boot_means = np.array([])                                           
for i in np.arange(10000):                                          
    resample = small_season.sample(season.shape[0], replace=False)  # Line 1
    resample_mean = small_season.get('PPG').mean()                  # Line 2
    np.append(boot_means, new_mean)                                 # Line 3

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.


Problem 13.1

What is incorrect about Line 1? Select all that apply.

Answers:

  • The sample size is season.shape[0], when it should be small_season.shape[0]
  • Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from small_season. When bootstrapping, we sample with replacement from our original sample, with a sample size that’s equal to the original sample’s size. Here, our original sample is small_season, so we should be taking samples of size small_season.shape[0] from it.

Option 1 is incorrect; season has nothing to do with this problem, as we are bootstrapping from small_season.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 13.2

What is incorrect about Line 2? Select all that apply.

Answer: Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in resample

The current implementation of Line 2 doesn’t use the resample at all, when it should. If we were to leave Line 2 as it is, all of the values in boot_means would be identical (and equal to the mean of the 'PPG' column in small_season).

Option 1 is incorrect since our bootstrapping procedure is independent of season. Option 3 is incorrect because .mean() is a valid Series method.


Difficulty: ⭐️

The average score on this problem was 98%.


Problem 13.3

What is incorrect about Line 3? Select all that apply.

Answers:

  • The result of calling np.append is not being reassigned to boot_means, so boot_means will be an empty array after running this procedure
  • new_mean is not a defined variable name, and should be replaced with resample_mean

np.append returns a new array and does not modify the array it is called on (boot_means, in this case), so Option 1 is a necessary fix. Furthermore, Option 3 is a necessary fix since new_mean wasn’t defined anywhere.

Option 2 is incorrect; if np.append were outside of the for-loop, none of the 10,000 resampled means would be saved in boot_means.


Difficulty: ⭐️

The average score on this problem was 94%.



Problem 14

For your convenience, we show the first few rows of season again below.

In the past three problems, we presumed that we had access to the entire season DataFrame. Now, suppose we only have access to the DataFrame small_season, which is a random sample of size 36 from season. We’re interested in learning about the true mean points per game of all players in season given just the information in small_season.

To start, we want to bootstrap small_season 10,000 times and compute the mean of the resample each time. We want to store these 10,000 bootstrapped means in the array boot_means.

Here is a broken implementation of this procedure.

boot_means = np.array([])                                           
for i in np.arange(10000):                                          
    resample = small_season.sample(season.shape[0], replace=False)  # Line 1
    resample_mean = small_season.get('PPG').mean()                  # Line 2
    np.append(boot_means, new_mean)                                 # Line 3

For each of the 3 lines of code above (marked by comments), specify what is incorrect about the line by selecting one or more of the corresponding options below. Or, select “Line _ is correct as-is” if you believe there’s nothing that needs to be changed about the line in order for the above code to run properly.


Problem 14.1

What is incorrect about Line 1? Select all that apply.

Answers:

  • The sample size is season.shape[0], when it should be small_season.shape[0]
  • Sampling is currently being done without replacement, when it should be done with replacement

Here, our goal is to bootstrap from small_season. When bootstrapping, we sample with replacement from our original sample, with a sample size that’s equal to the original sample’s size. Here, our original sample is small_season, so we should be taking samples of size small_season.shape[0] from it.

Option 1 is incorrect; season has nothing to do with this problem, as we are bootstrapping from small_season.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 14.2

What is incorrect about Line 2? Select all that apply.

Answer: Currently it is taking the mean of the 'PPG' column in small_season, when it should be taking the mean of the 'PPG' column in resample

The current implementation of Line 2 doesn’t use the resample at all, when it should. If we were to leave Line 2 as it is, all of the values in boot_means would be identical (and equal to the mean of the 'PPG' column in small_season).

Option 1 is incorrect since our bootstrapping procedure is independent of season. Option 3 is incorrect because .mean() is a valid Series method.


Difficulty: ⭐️

The average score on this problem was 98%.


Problem 14.3

What is incorrect about Line 3? Select all that apply.

Answers:

  • The result of calling np.append is not being reassigned to boot_means, so boot_means will be an empty array after running this procedure
  • new_mean is not a defined variable name, and should be replaced with resample_mean

np.append returns a new array and does not modify the array it is called on (boot_means, in this case), so Option 1 is a necessary fix. Furthermore, Option 3 is a necessary fix since new_mean wasn’t defined anywhere.

Option 2 is incorrect; if np.append were outside of the for-loop, none of the 10,000 resampled means would be saved in boot_means.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 14.4

Suppose we’ve now fixed everything that was incorrect about our bootstrapping implementation.

Recall from earlier in the exam that, in season, the mean number of points per game is 7, with a standard deviation of 5.

It turns out that when looking at just the players in small_season, the mean number of points per game is 9, with a standard deviation of 4. Remember that small_season is a random sample of size 36 taken from season.

Which of the following histograms visualizes the empirical distribution of the sample mean, computed using the bootstrapping procedure above?

Answer: Option 3

The key to this problem is knowing to use the Central Limit Theorem. Specifically, we know that if we collect many samples from a population with replacement, then the distribution of the sample means will be roughly normal with:

  • a mean that is equal to the mean of the population
  • a standard deviation that is \frac{\text{SD of population}}{\sqrt{\text{sample size}}}

Here, the “population” is small_season, because that is the sample we’re repeatedly (re)sampling from. While season is actually the population, it is not seen at all in the bootstrapping process, so it doesn’t directly influence the distribution of the bootstrapped sample means.

The mean of small_season is 9, and so is the distribution of bootstrapped sample means. The standard deviation of small_season is 4, so the square root law, the standard deviation of the distribution of bootstrapped sample means is \frac{4}{\sqrt{36}} = \frac{4}{6} = \frac{2}{3}.

The answer now boils down to choosing the histogram that looks roughly normally distributed with a mean of 9 and a standard deviation of \frac{2}{3}. Options 1 and 4 can be ruled out right away since their means seem to be smaller than 9. To decide between Options 2 and 3, we can use the inflection point rule, which states that in a normal distribution, the inflection points occur at one standard deviation above and one standard deviation below the mean. (An inflection point is when a curve changes from opening upwards to opening downwards.) See the picture below for more details.

Option 3 is the only distribution that appears to be centered at 9 with a standard deviation of \frac{2}{3} (0.7 is close to \frac{2}{3}), so it must be the empirical distribution of the bootstrapped sample means.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.


Problem 14.5

We construct a 95% confidence interval for the true mean points per game for all players by taking the middle 95% of the bootstrapped sample means.

left_b = np.percentile(boot_means, 2.5)
right_b = np.percentile(boot_means, 97.5)
boot_ci = [left_b, right_b]

Select the most correct statement below.

Answer: (left_b + right_b) / 2 is not necessarily equal to the mean points per game in small_season, but is close.

Normal-based confidence intervals are of the form [\text{mean} - \text{something}, \text{mean} + \text{something}]. In such confidence intervals, it is the case that the average of the left and right endpoints is exactly the mean of the distribution used to compute the interval.

However, the confidence interval we’ve created is not normal-based, rather it is bootstrap-based! As such, we can’t say that anything is exactly true; this rules out Options 1 and 3.

Our 95% confidence interval was created by taking the middle 95% of bootstrapped sample means. The distribution of bootstrapped sample means is roughly normal, and the normal distribution is symmetric (the mean and median are both equal, and represent the “center” of the distribution). This means that the middle of our 95% confidence interval should be roughly equal to the mean of the distribution of bootstrapped sample means. This implies that Option 4 is correct; the difference between Options 2 and 4 is that Option 4 uses small_season, which is the sample we bootstrapped from, while Option 2 uses season, which was not accessed at all in our bootstrapping procedure.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 14.6

Instead of bootstrapping, we could also construct a 95% confidence interval for the true mean points per game by using the Central Limit Theorem.

Recall that, when looking at just the players in small_season, the mean number of points per game is 9, with a standard deviation of 4. Also remember that small_season is a random sample of size 36 taken from season.

Using only the information that we have about small_season (i.e. without using any facts about season), compute a 95% confidence interval for the true mean points per game.

What are the left and right endpoints of your interval? Give your answers as numbers rounded to 3 decimal places.

Answer: [7.667, 10.333]

In a normal distribution, roughly 95% of values are within 2 standard deviations of the mean. The CLT tells us that the distribution of sample means is roughly normal, and in subpart 4 of this problem we already computed the SD of the distribution of sample means to be \frac{2}{3}.

So, our normal-based 95% confidence interval is computed as follows:

\begin{aligned} &[\text{mean of sample} - 2 \cdot \text{SD of distribution of sample means}, \text{mean of sample} + 2 \cdot \text{SD of distribution of sample means}] \\ &= [9 - 2 \cdot \frac{4}{\sqrt{36}}, 9 + 2 \cdot \frac{4}{\sqrt{36}}] \\ &= [9 - \frac{4}{3}, 9 + \frac{4}{3}] \\ &\approx \boxed{[7.667, 10.333]} \end{aligned}


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 14.7

Recall that the mean points per game in season is 7, which is not in the interval you found above (if it is, check your work!).

Select the true statement below.

Answer: The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

In a confidence interval, the confidence level gives us a level of confidence in the process used to create the confidence interval. If we repeat the process of collecting a sample from the population and using the sample to construct a c% confidence interval for the population mean, then roughly c% of the intervals we create should contain the population mean. Option 4 is the only option that corresponds to this interpretation; the others are all incorrect in different ways.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Problem 15

Answer the following questions about a basketball dataset.


Problem 15.1

Suppose you have a random sample of 36 games in a basketball season. In your sample, the mean number of points per game is 9, with a standard deviation of 4 points per game. Using only this information, compute a 95% confidence interval for the true mean points per game in that season. What are the left and right endpoints of your interval?

Answer: [7.667, 10.333]

In a normal distribution, roughly 95% of values are within 2 standard deviations of the mean. The CLT tells us that the distribution of sample means is roughly normal, and in subpart 4 of this problem we already computed the SD of the distribution of sample means to be \frac{2}{3}.

So, our normal-based 95% confidence interval is computed as follows:

\begin{aligned} &[\text{mean of sample} - 2 \cdot \text{SD of distribution of sample means}, \text{mean of sample} + 2 \cdot \text{SD of distribution of sample means}] \\ &= [9 - 2 \cdot \frac{4}{\sqrt{36}}, 9 + 2 \cdot \frac{4}{\sqrt{36}}] \\ &= [9 - \frac{4}{3}, 9 + \frac{4}{3}] \\ &\approx \boxed{[7.667, 10.333]} \end{aligned}


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 15.2

It turns out that the true mean number of points per game in that season was 7, which is not in the interval you found above (if it is, check your work!). Select the true statement below.

Answer: The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

In a confidence interval, the confidence level gives us a level of confidence in the process used to create the confidence interval. If we repeat the process of collecting a sample from the population and using the sample to construct a c% confidence interval for the population mean, then roughly c% of the intervals we create should contain the population mean. Option 4 is the only option that corresponds to this interpretation; the others are all incorrect in different ways.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Source: wi24-final — Q6

Problem 16

Suppose we sample 400 rows of olympians at random without replacement, then generate a 95% CLT-based confidence interval for the mean age of Olympic medalists based on this sample.


Problem 16.1

The CLT is stated for samples drawn with replacement, but in practice, we can often use it for samples drawn without replacement. What is it about this situation that makes it reasonable to still use the CLT despite the sample being drawn without replacement?

Answer: The sample is much smaller than the population.

The Central Limit Theorem (CLT) states that regardless of the shape of the population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size is sufficiently large. The key factor that makes it reasonable to still use the CLT, is the sample size relative to the population size. When the sample size is much smaller than the population size, as in this case where 400 rows of Olympians are sampled from a likely much larger population of Olympians, the effect of sampling without replacement becomes negligible.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.


Problem 16.2

Suppose our 95% CLT-based confidence interval for the mean age of Olympic medalists is [24.9, 26.1]. What was the mean of ages in our sample?

Answer: Mean = 25.5

We calculate the mean by first determining the width of our interval: 26.1 - 24.9 = 1.2, then we divide this width in half to get 0.6 which represents the distance from the mean to each side of the confidence interval. Using this we can find the mean in two ways: 24.9 + 0.6 = 25.5 OR 26.1 - 0.6 = 25.5.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 16.3

Suppose our 95% CLT-based confidence interval for the mean age of Olympic medalists is [24.9, 26.1]. What was the standard deviation of ages in our sample?

Answer: Standard deviation = 6

We can calculate the sample standard deviation (sample SD) by using the 95% confidence interval equation:

\text{sample mean} - 2 * \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2 * \frac{\text{sample SD}}{\sqrt{\text{sample size}}}.

Choose one of the end points and start plugging in all the information you have/calculated:

25.5 - 2*\frac{\text{sample SD}}{\sqrt{400}} = 24.9\text{sample SD} = \frac{(25.5 - 24.9)}{2}*\sqrt{400} = 6.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.



Source: wi24-final — Q10

Problem 17

We want to use the sample of data in olympians to estimate the mean age of Olympic beach volleyball players.


Problem 17.1

Which of the following distributions must be normally distributed in order to use the Central Limit Theorem to estimate this parameter?

Answer: None of the Above

The central limit theorem states that the distribution of possible sample means and sample sums is approximately normal, no matter the distribution of the population. Options A, B, and C are not probability distributions of the sum or mean of a large random sample draw with replacement.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 17.2

(10 pts) Next we want to use bootstrapping to estimate this parameter. Which of the following code implementations correctly generates an array called sample_means containing 10,000 bootstrapped sample means?

Way 1:

    sample_means = np.array([])
    for i in np.arange(10000):
        bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
        one_mean = (bv.sample(bv.shape[0], replace=True)
                      .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 2:

    sample_means = np.array([])
    for i in np.arange(10000):
        bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
        one_mean = (olympians.sample(olympians.shape[0], replace=True)
                             .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 3:

    sample_means = np.array([])
    for i in np.arange(10000):
        resample = olympians.sample(olympians.shape[0], replace=True)
        bv = resample[resample.get("Sport") == "Beach Volleyball"]
        one_mean = bv.get("Age").mean()
        sample_means = np.append(sample_means, one_mean)

Way 4:

    sample_means = np.array([])
    bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
    for i in np.arange(10000):
        one_mean = (bv.sample(bv.shape[0], replace=True)
                      .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 5:

    sample_means = np.array([])
    bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
    one_mean = (bv.sample(bv.shape[0], replace=True)
                  .get("Age").mean())
    for i in np.arange(10000):
        sample_means = np.append(sample_means, one_mean)

Answer: Way 1 and Way 4

  • Way 1 first creates a DataFrame bv, which is a subset DataFrame of olympians but filtered to only have "Beach Volleyball". It then samples from bv with replacement, and counts the mean of that sample and stores it in the variable one_sample. It does this 10,000 times (due to the for loop), each time creating a dataframe bv, sampling from it, calculating a mean, and then appending one_sample to the array sample_means. This is a correct way to bootstrap.
  • Way 2 is incorrect because it calculates one_mean using a sample from the entire olympians DataFrame, instead of the bv DataFrame with only the "Beach Volleyball" players. This will result in a sample of players of any sport, and the mean of those ages will be calculated.
  • Way 3 is incorrect because it queries for rows where the data in the "Sport" column equals "Beach Volleyball" after sampling from the DataFrame instead of before. This would lead to a mean that is not representative of a sample of volleyball players’ ages because we are sampling from all the rows with all different sports, most likely resulting in a smaller sample size of volleyball players. There would also be an inconsistent number of "Beach Volleyball" players in each sample.
  • Way 4 is essentially the same as Way 1, except it creates the DataFrame bv before the for loop. The DataFrame bv will always be the same, so it doesn’t really matter if we make bv before or after the for loop.
  • Way 5 is incorrect because the one_mean is calculated only once, but is appended to the sample_means array 10,000 times. As a result, the same mean is being appended, instead of a different mean being calculated and appended each iteration.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 17.3

For most of the answer choices in part (b), we do not have enough information to predict how the standard deviation of sample_means would come out. There is one answer choice, however, where we do have enough information to compute the standard deviation of sample_means. Which answer choice is this, and what is the standard deviation of sample_means for this answer choice?

Answer: Way 5

Way 5 results in a sample_means array with the same mean appended 10,000 times. As a result the standard deviation would be 0 because the entire array would be the same value repeated.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 17.4

There are 68 rows of olympians corresponding to beach volleyball players. Assume that in part (b), we correctly generated an array called sample_means containing 10,000 bootstrapped sample mean ages based on this original sample of 68 ages. The standard deviation of the original sample of 68 ages is approximately how many times larger than the standard deviation of sample_means? Give your answer to the nearest integer.

Answer: 8

Recall SD of sample_means = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}. The sample size equals 68. Based on this equation, the population SD is \sqrt{68} times larger than the SD of distribution of possible sample means. \sqrt{68} rounded to the nearest integer is 8.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Source: wi24-final — Q13

Problem 18

In our sample, we have data on 163 medals for the sport of table tennis. Based on our data, China seems to really dominate this sport, earning 81 of these medals.

That’s nearly half of the medals for just one country! We want to do a hypothesis test with the following hypotheses to see if this pattern is true in general, or just happens to be true in our sample.

Null: China wins half of Olympic table tennis medals.

Alternative: China does not win half of Olympic table tennis medals.


Problem 18.1

Why can these hypotheses be tested by constructing a confidence interval?

Answer: Since the test aims to determine whether a parameter is equal to a fixed value

The goal of a confidence interval is to provide a range of values that, given the data, are considered plausible for the parameter in question. If the null hypothesis’ fixed value does not fall within this interval, it suggests that the observed data is not very compatible with the null hypothesis. Thus in our case, if a 95% confidence interval for the proportion of medals won by China does not include ~0.5, then there’s statistical evidence at the 5% significance level to suggest that China does not win exactly half of the medals. So again in our case, confidence intervals work to test this hypothesis because we are attempting to find out whether or half of the medals (0.5) lies within our interval at the 95% confidence level.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.



Problem 18.2

Suppose we construct a 95% bootstrapped CI for the proportion of Olympic table tennis medals won by China. Select all true statements.

Answer: If we resampled our original sample and calculated the proportion of Olympic table tennis medals won by China in our resample, there is approximately a 95% chance our interval would contain this number.

The second option is the only correct answer because it accurately describes the process and interpretation of a bootstrap confidence interval. A 95% bootstrapped confidence interval means that if we repeatedly sampled from our original sample and constructed the interval each time, approximately 95% of those intervals would contain the true parameter. This statement does not imply that the true proportion has a 95% chance of falling within any single interval we construct; instead, it reflects the long-run proportion of such intervals that would contain the true proportion if we could repeat the process indefinitely. Thus, the confidence interval gives us a method to estimate the parameter with a specified level of confidence based on the resampling procedure.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 18.3

True or False: In this scenario, it would also be appropriate to create a 95% CLT-based confidence interval.

Answer: True

The statement is true because the Central Limit Theorem (CLT) applies to the sampling distribution of the proportion, given that the sample size is large enough, which in our case, with 163 medals, it is. The CLT asserts that the distribution of the sample mean (or proportion, in our case) will approximate a normal distribution as the sample size grows, allowing the use of standard methods to create confidence intervals. Therefore, a CLT-based confidence interval is appropriate for estimating the true proportion of Olympic table tennis medals won by China.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.



Problem 18.4

True or False: If our 95% bootstrapped CI came out to be [0.479, 0.518], we would reject the null hypothesis at the 0.05 significance level.

Answer: False

This is false, we would fail to reject the null hypothesis because the interval [0.479, 0.518] includes the value of 0.5, which corresponds to the null hypothesis that China wins half of the Olympic table tennis medals. If the confidence interval contains the hypothesized value, there is not enough statistical evidence to reject the null hypothesis at the specified significance level. In this case, the data does not provide sufficient evidence to conclude that the proportion of medals won by China is different from 0.5 at the 0.05 significance level.


Difficulty: ⭐️

The average score on this problem was 92%.



Problem 18.5

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, the confidence interval we’d create would be wider.

Answer: True

Lowering the significance level means that you require more evidence to reject the null hypothesis, thus seeking a higher confidence in your interval estimate. A higher confidence level corresponds to a wider interval because it must encompass a larger range of values to ensure that it contains the true population parameter with the increased probability. Thus as we lower the significance level, the interval we create will be wider, making this statement true.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 18.6

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, we would be more likely to conclude a statistically significant result.

Answer: False

This statement is false. A small significance level lowers the chance of getting a statistically significant result; our value for 0.01 significance has to be outside a 99% confidence interval to be statistically significant. In addition, the true parameter was already contained within the tighter 95% confidence interval, so we failed to reject the null hypothesis at the 0.05 significance level. This guarantees failing to reject the null hypotehsis at the 0.01 significance level since we know that whatever is contained in a 95% confidence interval has to also be contained in a 99% confidence interval. Thus, this answer is false.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.



Source: wi25-final — Q5

Problem 19

Among Hogwarts students, Chocolate Frogs are a popular enchanted treat. Chocolate Frogs are individually packaged, and every Chocolate Frog comes with a collectible card of a famous wizard (ex.”Albus Dumbledore"). There are 80 unique cards, and each package contains one card selected uniformly at random from these 80.

Neville would love to get a complete collection with all 80 cards, and he wants to know how many Chocolate Frogs he should expect to buy to make this happen.

Suppose we have access to a function called frog_experiment that takes no inputs and simulates the act of buying Chocolate Frogs until a complete collection of cards is obtained. The function returns the number of Chocolate Frogs that were purchased. Fill in the blanks below to run 10,000 simulations and set avg_frog_count to the average number of Chocolate Frogs purchased across these experiments.

    frog_counts = np.array([])  
    
    for i in np.arange(10000):
        frog_counts = np.append(__(a)__)
    
    avg_frog_count = __(b)__


Problem 19.1

What goes in blank (a)?

Answer: frog_counts, frog_experiment()

Each call to frog_experiment() simulates purchasing Chocolate Frogs until a complete set of 80 unique cards is obtained, returning the total number of frogs purchased in that simulation. The result of each simulation is then appended to the frog_counts array.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.


Problem 19.2

What goes in blank (b)?

Answer: frog_counts.mean()

After running the loop for 10000 times, the frog_counts array holds all the simulated totals. Taking the mean of that array (frog_counts.mean()) gives the average number of frogs needed to complete the set of 80 unique cards.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 19.3

Realistically, Neville can only afford to buy 300 Chocolate Frog cards. Using the simulated data in frog_counts, write a Python expression that evaluates to an approximation of the probability that Neville will be able to complete his collection.

Answer: np.count_nonzero(frog_counts <= 300) / len(frog_counts) or equivlent, such as np.count_nonzero(frog_counts <= 300) / 10000 or (frog_counts <= 300).mean()

In the simulated data, each entry of frog_counts is the number of Chocolate Frogs purchased in one simulation before collecting all 80 unique cards. We want to estimate the probability that Neville completes his collection with at most 300 cards.

frog_counts <= 300 creates a boolean array of the same length as frog_counts, where each element is True if the number of frogs used in that simulation was 300 or fewer, and False otherwise.

np.count_nonzero(frog_counts <= 300) counts how many simulations (out of all the simulations) met the condition since True evaluates to 1 and False evaluates to 0.

Dividing by the total number of simulations, len(frog_counts), converts that count to probability.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 19.4

True or False: The Central Limit Theorem states that the data in frog_counts is roughly normally distributed.

Answer: False

The Central Limit Theorem (CLT) says that the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

The Central Limit Theorem (CLT) does not claim that individual observations are normally distributed. In this problem, each entry of frog_counts is a single observation: the number of frogs purchased in one simulation to complete the collection. There is no requirement that these individual data points themselves follow a normal distribution.

However, if we repeatedly take many samples of such observations and compute the sample mean, then that mean would tend toward a normal distribution as the sample size grows, follows the CLT.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 38%.