← return to practice.dsc10.com
This quiz was administered in-person. It was closed-book and
closed-note; students were not allowed to use the DSC
10 Reference Sheet. Students had 20 minutes to work on
the quiz.
This quiz covered Lectures 20-23 of the Spring 2024 offering
of DSC 10.
The DataFrame bikes contains a sample of 500 bikes for
sale locally. Columns are:
"serial number" (int): An integer that uniquely
identifies the bicycle."price" (int): The list price of the bicycle."style" (str): Values are "electric",
"mountain", "road", "hybrid", and
"recumbent".You want to know if there is a significant difference in the sale
prices of "road" and "hybrid" bikes using a
permutation test. The hypotheses are:
Null: The prices of "road" and
"hybrid" bikes come from the same distribution.
Alt: On average, "hybrid" bikes are
more expensive than "road" bikes.
Using the bikes DataFrame and the difference in group
means (in the order "road" minus "hybrid") as
your test statistic, fill in the blanks so the code below generates
10,000 simulated statistics for the permutation test.
def find_diff(df):
group_means = df.groupby("shuffled").mean().get("price")
return group_means.loc["road"] - group_means.loc["hybrid"]
some_bikes = __(x)__
diffs = np.array([])
for i in np.arange(10000):
shuffled_df = some_bikes.assign(shuffled = __(y)__)
diffs = np.append(diffs, find_diff(shuffled_df))Answer:
(x):
bikes[(bikes.get("style") == "road") | (bikes.get("style") == "hybrid")]
(y): np.random.permutation(some bikes.get("style"))
The average score on this problem was 54%.
Do large values of the observed statistic make us lean towards the null or alternative hypothesis?
Null Hypothesis
Alternative Hypothesis
Answer: Null Hypothesis
The average score on this problem was 63%.
Suppose the p-value for this test evaluates to 0.04. What can you conclude based on this? Select all that apply.
Reject the null hypothesis at a significance level of 0.01.
Fail to reject the null hypothesis at a significance level of 0.01.
Reject the null hypothesis at a significance level of 0.05.
Fail to reject the null hypothesis at a significance level of 0.05
Answer: Fail to reject the null hypothesis at a significance level of 0.01, Reject the null hypothesis at a significance level of 0.05.
The average score on this problem was 48%.
You want to determine if the bikes for sale locally have the
following distribution of "style": "electric"
(15%), "mountain" (20%), "road" (40%),
"hybrid" (20%), and "recumbent" (5%). You want
to use the 500 rows of bikes to test the following
hypotheses:
Null: Bikes for sale locally are randomly drawn from the proposed distribution.
Alt: Bikes for sale locally are not randomly drawn from the proposed distribution.
Suppose that in bikes, the "style" column
is distributed as follows: "electric" (20%),
"mountain" (20%), "road" (30%),
"hybrid" (20%), and "recumbent" (10%).
Let’s do a hypothesis test with total variation distance (TVD) as the test statistic. Fill in the blanks below to complete a simulation that calculates 10,000 TVDs under the null hypothesis.
def total_variation_distance(distr1, distr2):
return __(a)__
proposed = np.array([0.15, 0.20, 0.40, 0.20, 0.05])
observed = np.array([0.20, 0.20, 0.30, 0.20, 0.10])
tvds = np.array([])
for i in np.arange(10000):
simulated = np.random.multinomial(__(b)__, __(c)__) / 500
new_tvd = total_variation_distance(__(d)__, __(e)__)
tvds = np.append(tvds, new_tvd)
Answer:
(a): np.abs((distr1 - distr2)).sum() / 2
(b): 500
(c): proposed
(d): simulated
(e): proposed
The average score on this problem was 78%.
Fill in the blanks below so that observed_tvd
corresponds to the observed value of the test statistic for this
hypothesis test.
observed_tvd = total_variation_distance(__(f)__, __(g)__)
Answer:
(f): observed
(g): proposed
The average score on this problem was 83%.
Which of the following correctly calculates the p-value for this hypothesis test?
p_value = np.count_nonzero(tvds >= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(tvds)
Answer:
p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)
The average score on this problem was 73%.