# Discussion 9: The Central Limit Theorem

The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.

## Problem 1

You need to estimate the proportion of American adults who want to be vaccinated against Covid-19. You plan to survey a random sample of American adults, and use the proportion of adults in your sample who want to be vaccinated as your estimate for the true proportion in the population. Your estimate must be within 0.04 of the true proportion, 95% of the time. Using the fact that the standard deviation of any dataset of 0’s and 1’s is no more than 0.5, calculate the minimum number of people you would need to survey. Input your answer below, as an integer.

Note: Before reviewing these solutions, it’s highly recommended to revisit the lecture on “Choosing Sample Sizes,” since this problem follows the main example from that lecture almost exactly.

While this solution is long, keep in mind from the start that our goal is to solve for the smallest sample size necessary to create a confidence interval that achieves certain criteria.

The Central Limit Theorem tells us that the distribution of the sample mean is roughly normal, regardless of the distribution of the population from which the samples are drawn. At first, it may not be clear how the Central Limit Theorem is relevant, but remember that proportions are means too – for instance, the proportion of adults who want to be vaccinated is equal to the mean of a collection of 1s and 0s, where we have a 1 for each adult that wants to be vaccinated and a 0 for each adult who doesn’t want to be vaccinated. What this means (😉) is that the Central Limit Theorem applies to the distribution of the sample proportion, so we can use it here too.

Not only do we know that the distribution of sample proportions is roughly normal, but we know its mean and standard deviation, too:

\begin{align*} \text{Mean of Distribution of Possible Sample Means} &= \text{Population Mean} = \text{Population Proportion} \\ \text{SD of Distribution of Possible Sample Means} &= \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \end{align*}

Using this information, we can create a 95% confidence interval for the population proportion, using the fact that in a normal distribution, roughly 95% of values are within 2 standard deviations of the mean:

\left[ \text{Population Proportion} - 2 \cdot \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}}, \: \text{Population Proportion} + 2 \cdot \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \right]

However, this interval depends on the population proportion (mean) and SD, which we don’t know. (If we did know these parameters, there would be no need to collect a sample!) Instead, we’ll use the sample proportion and SD as rough estimates:

\left[ \text{Sample Proportion} - 2 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}, \: \text{Sample Proportion} + 2 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} \right]

Note that the width of this interval – that is, its right endpoint minus its left endpoint – is: \text{width} = 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}

In the problem, we’re told that we want our interval to be accurate to within 0.04, which is equivalent to wanting the width of our interval to be less than or equal to 0.08 (since the interval extends the same amount above and below the sample proportion). As such, we need to pick the smallest sample size necessary such that:

\text{width} = 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} \leq 0.08

We can re-arrange the inequality above to solve for our sample’s size:

\begin{align*} 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} &\leq 0.08 \\ \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} &\leq 0.02 \\ \frac{1}{\sqrt{\text{Sample Size}}} &\leq \frac{0.02}{\text{Sample SD}} \\ \frac{\text{Sample SD}}{0.02} &\leq \sqrt{\text{Sample Size}} \\ \left( \frac{\text{Sample SD}}{0.02} \right)^2 &\leq \text{Sample Size} \end{align*}

All we now need to do is pick the smallest sample size that satisfies the above inequality. But there’s an issue – we don’t know what our sample SD is, because we haven’t collected our sample! Notice that in the inequality above, as the sample SD increases, so does the minimum necessary sample size. In order to ensure we don’t collect too small of a sample (which would result in the width of our confidence interval being larger than desired), we can use an upper bound for the SD of our sample. In the problem, we’re told that the largest possible SD of a sample of 0s and 1s is 0.5 – this means that if we replace our sample SD with 0.5, we will find a sample size such that the width of our confidence interval is guaranteed to be less than or equal to 0.08. This sample size may be larger than necessary, but that’s better than it being smaller than necessary.

By substituting 0.5 for the sample SD in the last inequality above, we get

\begin{align*} \left( \frac{\text{Sample SD}}{0.02} \right)^2 &\leq \text{Sample Size} \\\ \left( \frac{0.5}{0.02} \right)^2 &\leq \text{Sample Size} \\ 25^2 &\leq \text{Sample Size} \implies \text{Sample Size} \geq 625 \end{align*}

We need to pick the smallest possible sample size that is greater than or equal to 625; that’s just 625.

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.

## Problem 2

### Problem 2.1

For a given sample, a 90% confidence interval is narrower than a 95% confidence interval.

• True

• False

The more narrow an interval is, the less confident one is that the intervals one creates will contain the true population parameter.

##### Difficulty: ⭐️

The average score on this problem was 91%.

### Problem 2.2

The distribution of sample proportions is roughly normal for large samples because of the Central Limit Theorem.

• True

• False

This is just the definition of Central Limit Theorem. Remember that a proportion is a mean of 0s and 1s.

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.

### Problem 2.3

Chebyshev’s inequality implies that we can always create at least a 96% confidence interval from a bootstrap distribution using the mean of the distribution plus or minus 5 standard deviations.

• True

• False

By Chebyshev’s theorem, at least 1 - 1 / z^2 of the data is within z STD of the mean. Thus 1 - 1 / 5^2 = 0.96 of the data is within 5 STD of the mean.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.

## Problem 3

An IKEA employee has access to a data set of the purchase amounts for 40,000 customer transactions. This data set is roughly normally distributed with mean 150 dollars and standard deviation 25 dollars.

### Problem 3.1

Why is the distribution of purchase amounts roughly normal?

• because of the Central Limit Theorem

• for some other reason

The data that we have is a sample of purchase amounts. It is not a sample mean or sample sum, so the Central Limit Theorem does not apply. The data just naturally happens to be roughly normally distributed, like human heights, for example.

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.

### Problem 3.2

Shiv spends 300 dollars at IKEA. How would we describe Shiv’s purchase in standard units?

• 0 standard units

• 2 standard units

• 4 standard units

• 6 standard units

To standardize a data value, we subtract the mean of the distribution and divide by the standard deviation:

\begin{aligned} \text{standard units} &= \frac{300 - 150}{25} \\ &= \frac{150}{25} \\ &= 6 \end{aligned}

A more intuitive way to think about standard units is the number of standard deviations above the mean (where negative means below the mean). Here, Shiv spent 150 dollars more than average. One standard deviation is 25 dollars, so 150 dollars is six standard deviations.

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 3.3

Give the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount, based on this data.

The Central Limit Theorem tells us about the distribution of the sample mean purchase amount. That’s not the distribution the employee has, but rather a distribution that shows how the mean of a different sample of 40,000 purchases might have turned out. Specifically, we know the following information about the distribution of the sample mean.

1. It is roughly normally distributed.
2. Its mean is about 150 dollars, the same as the mean of the employee’s sample.
3. Its standard deviation is about \frac{\text{sample standard deviation}}{\sqrt{\text{sample size}}}=\frac{25}{\sqrt{40000}} = \frac{25}{200} = \frac{1}{8}.

Since the distribution of the sample mean is roughly normal, we can find a 95% confidence interval for the sample mean by stepping out two standard deviations from the center, using the fact that 95% of the area of a normal distribution falls within 2 standard deviations of the mean. Therefore the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount are

• 150 - 2*\frac{1}{8} = 149.75 dollars, and
• 150 + 2*\frac{1}{8} = 150.25 dollars.

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.

## Problem 4

### Problem 4.1

Suppose you have a random sample of 36 games in a basketball season. In your sample, the mean number of points per game is 9, with a standard deviation of 4 points per game. Using only this information, compute a 95% confidence interval for the true mean points per game in that season. What are the left and right endpoints of your interval?

In a normal distribution, roughly 95% of values are within 2 standard deviations of the mean. The CLT tells us that the distribution of sample means is roughly normal, and in subpart 4 of this problem we already computed the SD of the distribution of sample means to be \frac{2}{3}.

So, our normal-based 95% confidence interval is computed as follows:

\begin{aligned} &[\text{mean of sample} - 2 \cdot \text{SD of distribution of sample means}, \text{mean of sample} + 2 \cdot \text{SD of distribution of sample means}] \\ &= [9 - 2 \cdot \frac{4}{\sqrt{36}}, 9 + 2 \cdot \frac{4}{\sqrt{36}}] \\ &= [9 - \frac{4}{3}, 9 + \frac{4}{3}] \\ &\approx \boxed{[7.667, 10.333]} \end{aligned}

##### Difficulty: ⭐️⭐️

The average score on this problem was 87%.

### Problem 4.2

It turns out that the true mean number of points per game in that season was 7, which is not in the interval you found above (if it is, check your work!). Select the true statement below.

• The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, which means that the distribution of the sample mean is not normal.

• The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, which means that the distribution of points per game in small_season is not normal.

• The 95% confidence interval we created in the previous subpart did not contain the true mean points per game. This is to be expected, because the Central Limit Theorem is only correct 95% of the time.

• The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

• The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of them would contain the true mean points per game.

Answer: The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

In a confidence interval, the confidence level gives us a level of confidence in the process used to create the confidence interval. If we repeat the process of collecting a sample from the population and using the sample to construct a c% confidence interval for the population mean, then roughly c% of the intervals we create should contain the population mean. Option 4 is the only option that corresponds to this interpretation; the others are all incorrect in different ways.

##### Difficulty: ⭐️⭐️

The average score on this problem was 87%.