← return to practice.dsc10.com
This quiz was administered in-person. It was closed-book and
closed-note; students were not allowed to use the DSC
10 Reference Sheet. Students had 20 minutes to work on
the quiz.
This quiz covered Lectures 20-23 of the Spring 2024 offering
of DSC 10.
The DataFrame bikes
contains a sample of 500 bikes for
sale locally. Columns are:
"serial number" (int)
: An integer that uniquely
identifies the bicycle."price" (int)
: The list price of the bicycle."style" (str)
: Values are "electric"
,
"mountain"
, "road"
, "hybrid"
, and
"recumbent"
.You want to know if there is a significant difference in the sale
prices of "road"
and "hybrid"
bikes using a
permutation test. The hypotheses are:
Null: The prices of "road"
and
"hybrid"
bikes come from the same distribution.
Alt: On average, "hybrid"
bikes are
more expensive than "road"
bikes.
Using the bikes
DataFrame and the difference in group
means (in the order "road"
minus "hybrid"
) as
your test statistic, fill in the blanks so the code below generates
10,000 simulated statistics for the permutation test.
def find_diff(df):
= df.groupby("shuffled").mean().get("price")
group_means return group_means.loc["road"] - group_means.loc["hybrid"]
= __(x)__
some_bikes = np.array([])
diffs for i in np.arange(10000):
= some_bikes.assign(shuffled = __(y)__)
shuffled_df = np.append(diffs, find_diff(shuffled_df)) diffs
Answer:
(x):
bikes[(bikes.get("style") == "road") | (bikes.get("style") == "hybrid")]
(y): np.random.permutation(some bikes.get("style"))
The average score on this problem was 54%.
Do large values of the observed statistic make us lean towards the null or alternative hypothesis?
Null Hypothesis
Alternative Hypothesis
Answer: Null Hypothesis
The average score on this problem was 63%.
Suppose the p-value for this test evaluates to 0.04. What can you conclude based on this? Select all that apply.
Reject the null hypothesis at a significance level of 0.01.
Fail to reject the null hypothesis at a significance level of 0.01.
Reject the null hypothesis at a significance level of 0.05.
Fail to reject the null hypothesis at a significance level of 0.05
Answer: Fail to reject the null hypothesis at a significance level of 0.01, Reject the null hypothesis at a significance level of 0.05.
The average score on this problem was 48%.
You want to determine if the bikes for sale locally have the
following distribution of "style"
: "electric"
(15%), "mountain"
(20%), "road"
(40%),
"hybrid"
(20%), and "recumbent"
(5%). You want
to use the 500 rows of bikes
to test the following
hypotheses:
Null: Bikes for sale locally are randomly drawn from the proposed distribution.
Alt: Bikes for sale locally are not randomly drawn from the proposed distribution.
Suppose that in bikes
, the "style"
column
is distributed as follows: "electric"
(20%),
"mountain"
(20%), "road"
(30%),
"hybrid"
(20%), and "recumbent"
(10%).
Let’s do a hypothesis test with total variation distance (TVD) as the test statistic. Fill in the blanks below to complete a simulation that calculates 10,000 TVDs under the null hypothesis.
def total_variation_distance(distr1, distr2):
return __(a)__
proposed = np.array([0.15, 0.20, 0.40, 0.20, 0.05])
observed = np.array([0.20, 0.20, 0.30, 0.20, 0.10])
tvds = np.array([])
for i in np.arange(10000):
simulated = np.random.multinomial(__(b)__, __(c)__) / 500
new_tvd = total_variation_distance(__(d)__, __(e)__)
tvds = np.append(tvds, new_tvd)
Answer:
(a): np.abs((distr1 - distr2)).sum() / 2
(b): 500
(c): proposed
(d): simulated
(e): proposed
The average score on this problem was 78%.
Fill in the blanks below so that observed_tvd
corresponds to the observed value of the test statistic for this
hypothesis test.
observed_tvd = total_variation_distance(__(f)__, __(g)__)
Answer:
(f): observed
(g): proposed
The average score on this problem was 83%.
Which of the following correctly calculates the p-value for this hypothesis test?
p_value = np.count_nonzero(tvds >= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(tvds)
Answer:
p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)
The average score on this problem was 73%.