← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

You need to estimate the proportion of American adults who want to be
vaccinated against Covid-19. You plan to survey a random sample of
American adults, and use the proportion of adults in your sample who
want to be vaccinated as your estimate for the true proportion in the
population. Your estimate must be within 0.04 of the true proportion,
95% of the time. Using the fact that the standard deviation of any
dataset of 0’s and 1’s is no more than 0.5, calculate the minimum number
of people you would need to survey. Input your answer below, as an
**integer**.

**Answer: ** 625

*Note: Before reviewing these solutions, it’s highly recommended
to revisit the lecture on “Choosing Sample Sizes,” since this problem
follows the main example from that lecture almost exactly.*

While this solution is long, keep in mind from the start that our
goal is to solve for the **smallest sample size necessary**
to create a confidence interval that achieves certain criteria.

The Central Limit Theorem tells us that the distribution of the
sample mean is roughly normal, regardless of the distribution of the
population from which the samples are drawn. At first, it may not be
clear how the Central Limit Theorem is relevant, but remember that
proportions are means too – for instance, the proportion of adults who
want to be vaccinated is equal to the mean of a collection of 1s and 0s,
where we have a 1 for each adult that wants to be vaccinated and a 0 for
each adult who doesn’t want to be vaccinated. What this means (😉) is
that **the Central Limit Theorem applies to the distribution of
the sample proportion, so we can use it here too**.

Not only do we know that the distribution of sample proportions is roughly normal, but we know its mean and standard deviation, too:

\begin{align*} \text{Mean of Distribution of Possible Sample Means} &= \text{Population Mean} = \text{Population Proportion} \\ \text{SD of Distribution of Possible Sample Means} &= \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \end{align*}

Using this information, we can create a 95% confidence interval for the population proportion, using the fact that in a normal distribution, roughly 95% of values are within 2 standard deviations of the mean:

\left[ \text{Population Proportion} - 2 \cdot \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}}, \: \text{Population Proportion} + 2 \cdot \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \right]

However, this interval depends on the population proportion (mean) and SD, which we don’t know. (If we did know these parameters, there would be no need to collect a sample!) Instead, we’ll use the sample proportion and SD as rough estimates:

\left[ \text{Sample Proportion} - 2 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}, \: \text{Sample Proportion} + 2 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} \right]

Note that the width of this interval – that is, its right endpoint minus its left endpoint – is: \text{width} = 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}

In the problem, we’re told that we want our interval to be accurate
to within 0.04, which is equivalent to wanting the width of our interval
to be less than or equal to 0.08 (since the interval extends the same
amount above and below the sample proportion). As such, we need to pick
the **smallest sample size necessary** such that:

\text{width} = 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} \leq 0.08

We can re-arrange the inequality above to solve for our sample’s size:

\begin{align*} 4 \cdot \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} &\leq 0.08 \\ \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} &\leq 0.02 \\ \frac{1}{\sqrt{\text{Sample Size}}} &\leq \frac{0.02}{\text{Sample SD}} \\ \frac{\text{Sample SD}}{0.02} &\leq \sqrt{\text{Sample Size}} \\ \left( \frac{\text{Sample SD}}{0.02} \right)^2 &\leq \text{Sample Size} \end{align*}

All we now need to do is pick the smallest sample size that satisfies
the above inequality. But there’s an issue – **we don’t know what
our sample SD is, because we haven’t collected our sample**!
Notice that in the inequality above, as the sample SD increases, so does
the minimum necessary sample size. In order to ensure we don’t collect
too small of a sample (which would result in the width of our confidence
interval being *larger* than desired), we can use an upper bound
for the SD of our sample. In the problem, we’re told that the largest
possible SD of a sample of 0s and 1s is 0.5 – this means that if we
replace our sample SD with 0.5, we will find a sample size such that the
width of our confidence interval is guaranteed to be less than or equal
to 0.08. This sample size may be larger than necessary, but that’s
better than it being smaller than necessary.

By substituting 0.5 for the sample SD in the last inequality above, we get

\begin{align*} \left( \frac{\text{Sample SD}}{0.02} \right)^2 &\leq \text{Sample Size} \\\ \left( \frac{0.5}{0.02} \right)^2 &\leq \text{Sample Size} \\ 25^2 &\leq \text{Sample Size} \implies \text{Sample Size} \geq 625 \end{align*}

We need to pick the smallest possible sample size that is greater than or equal to 625; that’s just 625.

The average score on this problem was 40%.

Answer the following true/false questions.

For a given sample, a 90% confidence interval is narrower than a 95% confidence interval.

True

False

**Answer: ** True

The more narrow an interval is, the less confident one is that the intervals one creates will contain the true population parameter.

The average score on this problem was 91%.

The distribution of sample proportions is roughly normal for large samples because of the Central Limit Theorem.

True

False

**Answer: ** True

This is just the definition of Central Limit Theorem. Remember that a proportion is a mean of 0s and 1s.

The average score on this problem was 43%.

Chebyshev’s inequality implies that we can always create at least a 96% confidence interval from a bootstrap distribution using the mean of the distribution plus or minus 5 standard deviations.

True

False

**Answer: ** True

By Chebyshev’s theorem, at least `1 - 1 / z^2`

of the data
is within `z`

STD of the mean. Thus
`1 - 1 / 5^2 = 0.96`

of the data is within 5 STD of the
mean.

The average score on this problem was 51%.

An IKEA employee has access to a data set of the purchase amounts for 40,000 customer transactions. This data set is roughly normally distributed with mean 150 dollars and standard deviation 25 dollars.

Why is the distribution of purchase amounts roughly normal?

because of the Central Limit Theorem

for some other reason

**Answer: ** for some other reason

The data that we have is a sample of purchase amounts. It is not a sample mean or sample sum, so the Central Limit Theorem does not apply. The data just naturally happens to be roughly normally distributed, like human heights, for example.

The average score on this problem was 42%.

Shiv spends 300 dollars at IKEA. How would we describe Shiv’s purchase in standard units?

0 standard units

2 standard units

4 standard units

6 standard units

**Answer: ** 6 standard units

To standardize a data value, we subtract the mean of the distribution and divide by the standard deviation:

\begin{aligned} \text{standard units} &= \frac{300 - 150}{25} \\ &= \frac{150}{25} \\ &= 6 \end{aligned}

A more intuitive way to think about standard units is the number of standard deviations above the mean (where negative means below the mean). Here, Shiv spent 150 dollars more than average. One standard deviation is 25 dollars, so 150 dollars is six standard deviations.

The average score on this problem was 97%.

Give the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount, based on this data.

**Answer: ** 149.75 and 150.25 dollars

The Central Limit Theorem tells us about the distribution of the
sample mean purchase amount. That’s not the distribution the employee
has, but rather a distribution that shows how the mean of a
*different* sample of 40,000 purchases might have turned out.
Specifically, we know the following information about the distribution
of the sample mean.

- It is roughly normally distributed.
- Its mean is about 150 dollars, the same as the mean of the employee’s sample.
- Its standard deviation is about \frac{\text{sample standard deviation}}{\sqrt{\text{sample size}}}=\frac{25}{\sqrt{40000}} = \frac{25}{200} = \frac{1}{8}.

Since the distribution of the sample mean is roughly normal, we can find a 95% confidence interval for the sample mean by stepping out two standard deviations from the center, using the fact that 95% of the area of a normal distribution falls within 2 standard deviations of the mean. Therefore the endpoints of the CLT-based 95% confidence interval for the mean IKEA purchase amount are

- 150 - 2*\frac{1}{8} = 149.75 dollars, and
- 150 + 2*\frac{1}{8} = 150.25 dollars.

The average score on this problem was 36%.

Answer the following questions about a basketball dataset.

Suppose you have a random sample of 36 games in a basketball season. In your sample, the mean number of points per game is 9, with a standard deviation of 4 points per game. Using only this information, compute a 95% confidence interval for the true mean points per game in that season. What are the left and right endpoints of your interval?

**Answer:** [7.667,
10.333]

In a normal distribution, roughly 95% of values are within 2 standard deviations of the mean. The CLT tells us that the distribution of sample means is roughly normal, and in subpart 4 of this problem we already computed the SD of the distribution of sample means to be \frac{2}{3}.

So, our normal-based 95% confidence interval is computed as follows:

\begin{aligned} &[\text{mean of sample} - 2 \cdot \text{SD of distribution of sample means}, \text{mean of sample} + 2 \cdot \text{SD of distribution of sample means}] \\ &= [9 - 2 \cdot \frac{4}{\sqrt{36}}, 9 + 2 \cdot \frac{4}{\sqrt{36}}] \\ &= [9 - \frac{4}{3}, 9 + \frac{4}{3}] \\ &\approx \boxed{[7.667, 10.333]} \end{aligned}

The average score on this problem was 87%.

It turns out that the true mean number of points per game in that season was 7, which is not in the interval you found above (if it is, check your work!). Select the true statement below.

The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, which means that the distribution of the sample mean is not normal.

The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, which means that the distribution of points per game in

`small_season`

is not normal.The 95% confidence interval we created in the previous subpart did not contain the true mean points per game. This is to be expected, because the Central Limit Theorem is only correct 95% of the time.

The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of them would contain the true mean points per game.

The 95% confidence interval we created in the previous subpart did not contain the true mean points per game, but if we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of them would contain the true mean points per game.

**Answer:** The 95% confidence interval we created in
the previous subpart did not contain the true mean points per game, but
if we collected many original samples and constructed many 95%
confidence intervals, then roughly 95% of them would contain the true
mean points per game.

In a confidence interval, the confidence level gives us a level of
confidence in **the process** used to create the confidence
interval. If we repeat the process of collecting a sample from the
population and using the sample to construct a c% confidence interval
for the population mean, then **roughly** c% of the
intervals we create should contain the population mean. Option 4 is the
only option that corresponds to this interpretation; the others are all
incorrect in different ways.

The average score on this problem was 87%.