Spring 2024 Quiz 5

← return to practice.dsc10.com

This quiz was administered in-person. It was closed-book and closed-note; students were not allowed to use the DSC 10 Reference Sheet. Students had 20 minutes to work on the quiz.

This quiz covered Lectures 20-23 of the Spring 2024 offering of DSC 10.

The DataFrame bikes contains a sample of 500 bikes for sale locally. Columns are:

"serial number" (int): An integer that uniquely identifies the bicycle.
"price" (int): The list price of the bicycle.
"style" (str): Values are "electric", "mountain", "road", "hybrid", and "recumbent".

Problem 1

You want to know if there is a significant difference in the sale prices of "road" and "hybrid" bikes using a permutation test. The hypotheses are:

Null: The prices of "road" and "hybrid" bikes come from the same distribution.
Alt: On average, "hybrid" bikes are more expensive than "road" bikes.

Problem 1.1

Using the bikes DataFrame and the difference in group means (in the order "road" minus "hybrid") as your test statistic, fill in the blanks so the code below generates 10,000 simulated statistics for the permutation test.

def find_diff(df):
    group_means = df.groupby("shuffled").mean().get("price")
    return group_means.loc["road"] - group_means.loc["hybrid"]

some_bikes = __(x)__
diffs = np.array([])
for i in np.arange(10000):
    shuffled_df = some_bikes.assign(shuffled = __(y)__)  
    diffs = np.append(diffs, find_diff(shuffled_df))

Answer:

(x): bikes[(bikes.get("style") == "road") | (bikes.get("style") == "hybrid")]
(y): np.random.permutation(some bikes.get("style"))

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.

Problem 1.2

Do large values of the observed statistic make us lean towards the null or alternative hypothesis?

Null Hypothesis
Alternative Hypothesis

Answer: Null Hypothesis

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.

Problem 1.3

Suppose the p-value for this test evaluates to 0.04. What can you conclude based on this? Select all that apply.

Reject the null hypothesis at a significance level of 0.01.
Fail to reject the null hypothesis at a significance level of 0.01.
Reject the null hypothesis at a significance level of 0.05.
Fail to reject the null hypothesis at a significance level of 0.05

Answer: Fail to reject the null hypothesis at a significance level of 0.01, Reject the null hypothesis at a significance level of 0.05.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.

Problem 2

You want to determine if the bikes for sale locally have the following distribution of "style": "electric" (15%), "mountain" (20%), "road" (40%), "hybrid" (20%), and "recumbent" (5%). You want to use the 500 rows of bikes to test the following hypotheses:

Null: Bikes for sale locally are randomly drawn from the proposed distribution.
Alt: Bikes for sale locally are not randomly drawn from the proposed distribution.

Problem 2.1

Suppose that in bikes, the "style" column is distributed as follows: "electric" (20%), "mountain" (20%), "road" (30%), "hybrid" (20%), and "recumbent" (10%).

Let’s do a hypothesis test with total variation distance (TVD) as the test statistic. Fill in the blanks below to complete a simulation that calculates 10,000 TVDs under the null hypothesis.

def total_variation_distance(distr1, distr2):
    return __(a)__

proposed = np.array([0.15, 0.20, 0.40, 0.20, 0.05])
observed = np.array([0.20, 0.20, 0.30, 0.20, 0.10])
tvds = np.array([])
for i in np.arange(10000):
    simulated = np.random.multinomial(__(b)__, __(c)__) / 500
    new_tvd = total_variation_distance(__(d)__, __(e)__)
    tvds = np.append(tvds, new_tvd)

Answer:
(a): np.abs((distr1 - distr2)).sum() / 2
(b): 500
(c): proposed
(d): simulated
(e): proposed

Difficulty: ⭐️⭐️

The average score on this problem was 78%.

Problem 2.2

Fill in the blanks below so that observed_tvd corresponds to the observed value of the test statistic for this hypothesis test.

observed_tvd = total_variation_distance(__(f)__, __(g)__)

Answer:
(f): observed
(g): proposed

Difficulty: ⭐️⭐️

The average score on this problem was 83%.

Problem 2.3

Which of the following correctly calculates the p-value for this hypothesis test?

p_value = np.count_nonzero(tvds >= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(observed)
p_value = np.count_nonzero(tvds <= observed_tvd) / len(tvds)

Answer: p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 2

Problem 2.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.