← return to practice.dsc10.com
The problems in this worksheet are taken from past exams. Work on
them on paper, since the exams you take in this course
will also be on paper.
We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all
problems here in the live discussion section; the problems we don’t
cover can be used for extra practice.
Oren has a random sample of 200 dog prices in an array called
oren
. He has also bootstrapped his sample 1,000 times and
stored the mean of each resample in an array called
boots
.
In this question, assume that the following code has run:
= np.mean(oren)
a = np.std(oren)
b = len(oren) c
What expression best estimates the population’s standard deviation?
b
b / c
b / np.sqrt(c)
b * np.sqrt(c)
Which expression best estimates the mean of boots
?
0
a
(oren - a).mean()
(oren - a) / b
What expression best estimates the standard deviation of
boots
?
b
b / c
b / np.sqrt(c)
(a -b) / np.sqrt(c)
What is the dog price of $560 in standard units?
(560 - a) / b
(560 - a) / (b / np.sqrt(c))
(a - 560) / (b / np.sqrt(c))}
abs(560 - a) / b
abs(560 - a) / (b / np.sqrt(c))
The distribution of boots
is normal because of the
Central Limit Theorem.
True
False
If Oren’s sample was 400 dogs instead of 200, the standard deviation
of boots
will…
Increase by a factor of 2
Increase by a factor of \sqrt{2}
Decrease by a factor of 2
Decrease by a factor of \sqrt{2}
None of the above
If Oren took 4000 bootstrap resamples instead of 1000, the standard
deviation of boots
will…
Increase by a factor of 4
Increase by a factor of 2
Decrease by a factor of 2
Decrease by a factor of 4
None of the above
Write one line of code that evaluates to the right endpoint of a 92% CLT-Based confidence interval for the mean dog price. The following expressions may help:
1.75) # => 0.96
stats.norm.cdf(1.4) # => 0.92 stats.norm.cdf(
From a population with mean 500 and standard deviation 50, you collect a sample of size 100. The sample has mean 400 and standard deviation 40. You bootstrap this sample 10,000 times, collecting 10,000 resample means.
Which of the following is the most accurate description of the mean of the distribution of the 10,000 bootstrapped means?
The mean will be exactly equal to 400.
The mean will be exactly equal to 500.
The mean will be approximately equal to 400.
The mean will be approximately equal to 500.
Which of the following is closest to the standard deviation of the distribution of the 10,000 bootstrapped means?
400
40
4
0.4
Suppose you draw a sample of size 100 from a population with mean 50 and standard deviation 15. What is the probability that your sample has a mean between 50 and 53? Input the probability below, as a number between 0 and 1, rounded to two decimal places.
The DataFrame apps
contains application data for a
random sample of 1,000 applicants for a particular credit card from the
1990s. The "age"
column contains the applicants’ ages, in
years, to the nearest twelfth of a year.
The credit card company that owns the data in apps
,
BruinCard, has decided not to give us access to the entire
apps
DataFrame, but instead just a random sample of 100
rows of apps called hundred_apps
.
We are interested in estimating the mean age of all applicants in
apps
given only the data in hundred_apps
. The
ages in hundred_apps
have a mean of 35 and a standard
deviation of 10.
Give the endpoints of the CLT-based 95% confidence interval for the
mean age of all applicants in apps
, based on the data in
hundred_apps
.
BruinCard reinstates our access to apps
so that we can
now easily extract information about the ages of all applicants. We
determine that, just like in hundred_apps
, the ages in
apps
have a mean of 35 and a standard deviation of 10. This
raises the question of how other samples of 100 rows of
apps
would have turned out, so we compute 10,000 sample means as follows.
= np.array([])
sample_means for i in np.arange(10000):
= apps.sample(100, replace=True).get("age").mean()
sample_mean = np.append(sample_means, sample_mean) sample_means
Which of the following three visualizations best depict the
distribution of sample_means
?
Which of the following statements are guaranteed to be true? Select all that apply.
We used bootstrapping to compute sample_means
.
The ages of credit card applicants are roughly normally distributed.
A CLT-based 90% confidence interval for the mean age of credit card applicants, based on the data in hundred apps, would be narrower than the interval you gave in part (a).
The expression np.percentile(sample_means, 2.5)
evaluates to the left endpoint of the interval you gave in part (a).
If we used the data in hundred_apps
to create 1,000
CLT-based 95% confidence intervals for the mean age of applicants in
apps
, approximately 950 of them would contain the true mean
age of applicants in apps
.
None of the above.
You need to estimate the proportion of American adults who want to be vaccinated against Covid-19. You plan to survey a random sample of American adults, and use the proportion of adults in your sample who want to be vaccinated as your estimate for the true proportion in the population. Your estimate must be within 0.04 of the true proportion, 95% of the time. Using the fact that the standard deviation of any dataset of 0’s and 1’s is no more than 0.5, calculate the minimum number of people you would need to survey. Input your answer below, as an integer.
It’s your first time playing a new game called Brunch Menu. The deck contains 96 cards, and each player will be dealt a hand of 9 cards. The goal of the game is to avoid having certain cards, called Rotten Egg cards, which come with a penalty at the end of the game. But you’re not sure how many of the 96 cards in the game are Rotten Egg cards. So you decide to use the Central Limit Theorem to estimate the proportion of Rotten Egg cards in the deck based on the 9 random cards you are dealt in your hand.
You are dealt 3 Rotten Egg cards in your hand of 9 cards. You then construct a CLT-based 95% confidence interval for the proportion of Rotten Egg cards in the deck based on this sample. Approximately, how wide is your confidence interval?
Choose the closest answer, and use the following facts:
The standard deviation of a collection of 0s and 1s is \sqrt{(\text{Prop. of 0s}) \cdot (\text{Prop of 1s})}.
\sqrt{18} is about \frac{17}{4}.
\frac{17}{9}
\frac{17}{27}
\frac{17}{81}
\frac{17}{96}
Which of the following are limitations of trying to use the Central Limit Theorem for this particular application? Select all that apply.
The CLT is for large random samples, and our sample was not very large.
The CLT is for random samples drawn with replacement, and our sample was drawn without replacement.
The CLT is for normally distributed data, and our data may not have been normally distributed.
The CLT is for sample means and sums, not sample proportions.
You want to estimate the proportion of DSC majors who have a Netflix subscription. To do so, you will survey a random sample of DSC majors and ask them whether they have a Netflix subscription. You will then create a 95% confidence interval for the proportion of “yes" answers in the population, based on the responses in your sample. You decide that your confidence interval should have a width of at most 0.10.
In order for your confidence interval to have a width of at most 0.10, the standard deviation of the distribution of the sample proportion must be at most T. What is T? Give your answer as an exact decimal.
Using the fact that the standard deviation of any dataset of 0s and 1s is no more than 0.5, calculate the minimum number of people you would need to survey so that the width of your confidence interval is at most 0.10. Give your answer as an integer.
Arya was curious how many UCSD students used Hulu over Thanksgiving break. He surveys 250 students and finds that 130 of them did use Hulu over break and 120 did not.
Using this data, Arya decides to test following hypotheses:
Null Hypothesis: Over Thanksgiving break, an equal number of UCSD students did use Hulu and did not use Hulu.
Alternative Hypothesis: Over Thanksgiving break, more UCSD students did use Hulu than did not use Hulu.
Which of the following could be used as a test statistic for the hypothesis test?
The proportion of students who did use Hulu minus the proportion of students who did not use Hulu.
The absolute value of the proportion of students who did use Hulu minus the proportion of students who did not use Hulu.
The proportion of students who did use Hulu plus the proportion of students who did not use Hulu.
The absolute value of the proportion of students who did use Hulu plus the proportion of students who did not use Hulu.
For the test statistic that you chose in part (a), what is the observed value of the statistic? Give your answer either as an exact decimal or a simplified fraction.
If the p-value of the hypothesis test is 0.053, what can we conclude, at the standard 0.05 significance level?
We reject the null hypothesis.
We fail to reject the null hypothesis.
We accept the null hypothesis.
At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.
= np.array([550, 1550, 400]) admissions_data
Complete the code below so that it creates an array
admissions_proportions
with the proportions of tickets sold
to each group (in the order child, adult, senior).
def as_proportion(data):
return __(a)__
= as_proportion(admissions_data) admissions_proportions
What goes in blank (a)?
The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.
= np.array([0.25, 0.6, 0.15]) model
We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?
Child, adult, and senior tickets might plausibly be purchased in proportions 0.25, 0.6, and 0.15.
Child, adult, and senior tickets are purchased in proportions 0.25, 0.6, and 0.15.
Child, adult, and senior tickets might plausibly be purchased in proportions other than 0.25, 0.6, and 0.15.
Child, adult, and senior tickets, are purchased in proportions other than 0.25, 0.6, and 0.15.
Which of the following test statistics could we use to test our hypotheses? Select all that could work.
sum of differences in proportions
sum of squared differences in proportions
mean of differences in proportions
mean of squared differences in proportions
none of the above
Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.
Recall that the ticket counts we observed for children, adults, and
seniors are stored in the array
admissions_data = np.array([550, 1550, 400])
, and that our
model is model = np.array([0.25, 0.6, 0.15])
.
For our hypothesis test to determine whether the admissions data is
consistent with our model, what is the observed value of the test
statistic? Input your answer as a decimal between 0 and 1. Round to
three decimal places. (Suppose that the value you calculated is assigned
to the variable observed_stat
, which you will use in later
questions.)
Now, we want to simulate the test statistic 10,000 times under the
assumptions of the null hypothesis. Fill in the blanks below to complete
this simulation and calculate the p-value for our hypothesis test.
Assume that the variables admissions_data
,
admissions_proportions
, model
, and
observed_stat
are already defined as specified earlier in
the question.
= np.array([])
simulated_stats for i in np.arange(10000):
= as_proportions(np.random.multinomial(__(a)__, __(b)__))
simulated_proportions = __(c)__
simulated_stat = np.append(simulated_stats, simulated_stat)
simulated_stats
= __(d)__ p_value
What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?
True or False: the p-value represents the probability that the null hypothesis is true.
True
False
The new statistic that we used for this hypothesis test, the mean of
the absolute differences in proportions, is in fact closely related to
the total variation distance. Given two arrays of length three,
array_1
and array_2
, suppose we compute the
mean of the absolute differences in proportions between
array_1
and array_2
and store the result as
madp
. What value would we have to multiply
madp
by to obtain the total variation distance
array_1
and array_2
? Input your answer below,
rounding to three decimal places.
For this question, let’s think of the data in app_data
as a random sample of all IKEA purchases and use it to test the
following hypotheses.
Null Hypothesis: IKEA sells an equal amount of beds
(category 'bed'
) and outdoor furniture (category
'outdoor'
).
Alternative Hypothesis: IKEA sells more beds than outdoor furniture.
The DataFrame app_data
contains 5000 rows, which form
our sample. Of these 5000 products,
Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.
Among 2500 beds and outdoor furniture items, the absolute difference between the proportion of beds and the proportion of outdoor furniture.
Among 2500 beds and outdoor furniture items, the proportion of beds.
Among 2500 beds and outdoor furniture items, the number of beds.
Among 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items.
Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.
Complete the code below to calculate the observed value of the test
statistic and save the result as obs_diff
.
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = ( ___(a)___ - ___(b)___ ) / ___(c)___ obs_diff
The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.
Which of the following is a valid way to generate one value of the test statistic according to the null model? Select all that apply.
Way 1:
= np.random.multinomial(2500, [0.5,0.5])
multi 0] - multi[1])/2500 (multi[
Way 2:
= np.random.multinomial(2500, [0.5,0.5])[0]/2500
outdoor = np.random.multinomial(2500, [0.5,0.5])[1]/2500
bed - bed outdoor
Way 3:
= np.random.choice([0, 1], 2500, replace=True)
choice = choice.sum( )
choice_sum - (2500 - choice_sum))/2500 (choice_sum
Way 4:
= np.random.choice(['bed', 'outdoor'], 2500, replace=True)
choice = np.count_nonzero(choice=='bed')
bed = np.count_nonzero(choice=='outdoor')
outdoor /2500 - bed/2500 outdoor
Way 5:
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = app_data[outdoor|bed].sample(2500, replace=True)
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500) samp[samp.get(
Way 6:
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = (app_data[outdoor|bed].groupby('category').count( ).reset_index( ).sample(2500, replace=True))
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500 samp[samp.get(
Way 1
Way 2
Way 3
Way 4
Way 5
Way 6
Suppose we generate 10,000 simulated values of the test statistic
according to the null model and store them in an array called
simulated_diffs
. Complete the code below to calculate the
p-value for the hypothesis test.
/10000 np.count_nonzero(simulated_diffs _________ obs_diff)
What goes in the blank?
<
<=
>
>=