Discussion 8: The Central Limit Theorem and Hypothesis Testing

← return to practice.dsc10.com


The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

Oren has a random sample of 200 dog prices in an array called oren. He has also bootstrapped his sample 1,000 times and stored the mean of each resample in an array called boots.

In this question, assume that the following code has run:

a = np.mean(oren)
b = np.std(oren)
c = len(oren)


Problem 1.1

What expression best estimates the population’s standard deviation?


Problem 1.2

Which expression best estimates the mean of boots?


Problem 1.3

What expression best estimates the standard deviation of boots?


Problem 1.4

What is the dog price of $560 in standard units?


Problem 1.5

The distribution of boots is normal because of the Central Limit Theorem.


Problem 1.6

If Oren’s sample was 400 dogs instead of 200, the standard deviation of boots will…


Problem 1.7

If Oren took 4000 bootstrap resamples instead of 1000, the standard deviation of boots will…


Problem 1.8

Write one line of code that evaluates to the right endpoint of a 92% CLT-Based confidence interval for the mean dog price. The following expressions may help:

stats.norm.cdf(1.75) # => 0.96
stats.norm.cdf(1.4)  # => 0.92



Problem 2

From a population with mean 500 and standard deviation 50, you collect a sample of size 100. The sample has mean 400 and standard deviation 40. You bootstrap this sample 10,000 times, collecting 10,000 resample means.


Problem 2.1

Which of the following is the most accurate description of the mean of the distribution of the 10,000 bootstrapped means?


Problem 2.2

Which of the following is closest to the standard deviation of the distribution of the 10,000 bootstrapped means?



Problem 3

Suppose you draw a sample of size 100 from a population with mean 50 and standard deviation 15. What is the probability that your sample has a mean between 50 and 53? Input the probability below, as a number between 0 and 1, rounded to two decimal places.


Problem 4

The DataFrame apps contains application data for a random sample of 1,000 applicants for a particular credit card from the 1990s. The "age" column contains the applicants’ ages, in years, to the nearest twelfth of a year.

The credit card company that owns the data in apps, BruinCard, has decided not to give us access to the entire apps DataFrame, but instead just a random sample of 100 rows of apps called hundred_apps.

We are interested in estimating the mean age of all applicants in apps given only the data in hundred_apps. The ages in hundred_apps have a mean of 35 and a standard deviation of 10.


Problem 4.1

Give the endpoints of the CLT-based 95% confidence interval for the mean age of all applicants in apps, based on the data in hundred_apps.


Problem 4.2

BruinCard reinstates our access to apps so that we can now easily extract information about the ages of all applicants. We determine that, just like in hundred_apps, the ages in apps have a mean of 35 and a standard deviation of 10. This raises the question of how other samples of 100 rows of apps would have turned out, so we compute 10,000 sample means as follows.

    sample_means = np.array([])
    for i in np.arange(10000):
        sample_mean = apps.sample(100, replace=True).get("age").mean()
        sample_means = np.append(sample_means, sample_mean)

Which of the following three visualizations best depict the distribution of sample_means?



Problem 4.3

Which of the following statements are guaranteed to be true? Select all that apply.



Problem 5

You need to estimate the proportion of American adults who want to be vaccinated against Covid-19. You plan to survey a random sample of American adults, and use the proportion of adults in your sample who want to be vaccinated as your estimate for the true proportion in the population. Your estimate must be within 0.04 of the true proportion, 95% of the time. Using the fact that the standard deviation of any dataset of 0’s and 1’s is no more than 0.5, calculate the minimum number of people you would need to survey. Input your answer below, as an integer.


Problem 6

It’s your first time playing a new game called Brunch Menu. The deck contains 96 cards, and each player will be dealt a hand of 9 cards. The goal of the game is to avoid having certain cards, called Rotten Egg cards, which come with a penalty at the end of the game. But you’re not sure how many of the 96 cards in the game are Rotten Egg cards. So you decide to use the Central Limit Theorem to estimate the proportion of Rotten Egg cards in the deck based on the 9 random cards you are dealt in your hand.


Problem 6.1

You are dealt 3 Rotten Egg cards in your hand of 9 cards. You then construct a CLT-based 95% confidence interval for the proportion of Rotten Egg cards in the deck based on this sample. Approximately, how wide is your confidence interval?

Choose the closest answer, and use the following facts:


Problem 6.2

Which of the following are limitations of trying to use the Central Limit Theorem for this particular application? Select all that apply.



Problem 7

You want to estimate the proportion of DSC majors who have a Netflix subscription. To do so, you will survey a random sample of DSC majors and ask them whether they have a Netflix subscription. You will then create a 95% confidence interval for the proportion of “yes" answers in the population, based on the responses in your sample. You decide that your confidence interval should have a width of at most 0.10.


Problem 7.1

In order for your confidence interval to have a width of at most 0.10, the standard deviation of the distribution of the sample proportion must be at most T. What is T? Give your answer as an exact decimal.


Problem 7.2

Using the fact that the standard deviation of any dataset of 0s and 1s is no more than 0.5, calculate the minimum number of people you would need to survey so that the width of your confidence interval is at most 0.10. Give your answer as an integer.



Problem 8

Arya was curious how many UCSD students used Hulu over Thanksgiving break. He surveys 250 students and finds that 130 of them did use Hulu over break and 120 did not.

Using this data, Arya decides to test following hypotheses:


Problem 8.1

Which of the following could be used as a test statistic for the hypothesis test?


Problem 8.2

For the test statistic that you chose in part (a), what is the observed value of the statistic? Give your answer either as an exact decimal or a simplified fraction.


Problem 8.3

If the p-value of the hypothesis test is 0.053, what can we conclude, at the standard 0.05 significance level?



Problem 9

At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.

admissions_data = np.array([550, 1550, 400])


Problem 9.1

Complete the code below so that it creates an array admissions_proportions with the proportions of tickets sold to each group (in the order child, adult, senior).

def as_proportion(data):
    return __(a)__

admissions_proportions = as_proportion(admissions_data)

What goes in blank (a)?


Problem 9.2

The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.

model = np.array([0.25, 0.6, 0.15])

We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?


Problem 9.3

Which of the following test statistics could we use to test our hypotheses? Select all that could work.


Problem 9.4

Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.

Recall that the ticket counts we observed for children, adults, and seniors are stored in the array admissions_data = np.array([550, 1550, 400]), and that our model is model = np.array([0.25, 0.6, 0.15]).

For our hypothesis test to determine whether the admissions data is consistent with our model, what is the observed value of the test statistic? Input your answer as a decimal between 0 and 1. Round to three decimal places. (Suppose that the value you calculated is assigned to the variable observed_stat, which you will use in later questions.)


Problem 9.5

Now, we want to simulate the test statistic 10,000 times under the assumptions of the null hypothesis. Fill in the blanks below to complete this simulation and calculate the p-value for our hypothesis test. Assume that the variables admissions_data, admissions_proportions, model, and observed_stat are already defined as specified earlier in the question.

simulated_stats = np.array([]) 
for i in np.arange(10000):
    simulated_proportions = as_proportions(np.random.multinomial(__(a)__, __(b)__))
    simulated_stat = __(c)__
    simulated_stats = np.append(simulated_stats, simulated_stat)

p_value = __(d)__

What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?


Problem 9.6

True or False: the p-value represents the probability that the null hypothesis is true.


Problem 9.7

The new statistic that we used for this hypothesis test, the mean of the absolute differences in proportions, is in fact closely related to the total variation distance. Given two arrays of length three, array_1 and array_2, suppose we compute the mean of the absolute differences in proportions between array_1 and array_2 and store the result as madp. What value would we have to multiply madp by to obtain the total variation distance array_1 and array_2? Input your answer below, rounding to three decimal places.



Problem 10

For this question, let’s think of the data in app_data as a random sample of all IKEA purchases and use it to test the following hypotheses.

Null Hypothesis: IKEA sells an equal amount of beds (category 'bed') and outdoor furniture (category 'outdoor').

Alternative Hypothesis: IKEA sells more beds than outdoor furniture.

The DataFrame app_data contains 5000 rows, which form our sample. Of these 5000 products,


Problem 10.1

Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.


Problem 10.2

Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.

Complete the code below to calculate the observed value of the test statistic and save the result as obs_diff.

    outdoor = (app_data.get('category')=='outdoor') 
    bed = (app_data.get('category')=='bed')
    obs_diff = ( ___(a)___ - ___(b)___ ) / ___(c)___

The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.


Problem 10.3

Which of the following is a valid way to generate one value of the test statistic according to the null model? Select all that apply.

Way 1:

multi = np.random.multinomial(2500, [0.5,0.5]) 
(multi[0] - multi[1])/2500

Way 2:

outdoor = np.random.multinomial(2500, [0.5,0.5])[0]/2500 
bed = np.random.multinomial(2500, [0.5,0.5])[1]/2500 
outdoor - bed 

Way 3:

choice = np.random.choice([0, 1], 2500, replace=True) 
choice_sum = choice.sum( ) 
(choice_sum - (2500 - choice_sum))/2500

Way 4:

choice = np.random.choice(['bed', 'outdoor'], 2500, replace=True) 
bed = np.count_nonzero(choice=='bed')
outdoor = np.count_nonzero(choice=='outdoor')
outdoor/2500 - bed/2500

Way 5:

outdoor = (app_data.get('category')=='outdoor') 
bed = (app_data.get('category')=='bed')
samp = app_data[outdoor|bed].sample(2500, replace=True) 
samp[samp.get('category')=='outdoor'].shape[0]/2500 -  samp[samp.get('category')=='bed'].shape[0]/2500)

Way 6:

outdoor = (app_data.get('category')=='outdoor') 
bed = (app_data.get('category')=='bed')
samp = (app_data[outdoor|bed].groupby('category').count( ).reset_index( ).sample(2500, replace=True))    
samp[samp.get('category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500


Problem 10.4

Suppose we generate 10,000 simulated values of the test statistic according to the null model and store them in an array called simulated_diffs. Complete the code below to calculate the p-value for the hypothesis test.

    np.count_nonzero(simulated_diffs _________ obs_diff)/10000

What goes in the blank?



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.