Spring 2024 Quiz 5

← return to practice.dsc10.com


This quiz was administered in-person. It was closed-book and closed-note; students were not allowed to use the DSC 10 Reference Sheet. Students had 20 minutes to work on the quiz.

This quiz covered Lectures 20-23 of the Spring 2024 offering of DSC 10.


The DataFrame bikes contains a sample of 500 bikes for sale locally. Columns are:


Problem 1

You want to know if there is a significant difference in the sale prices of "road" and "hybrid" bikes using a permutation test. The hypotheses are:


Problem 1.1

Using the bikes DataFrame and the difference in group means (in the order "road" minus "hybrid") as your test statistic, fill in the blanks so the code below generates 10,000 simulated statistics for the permutation test.

def find_diff(df):
    group_means = df.groupby("shuffled").mean().get("price")
    return group_means.loc["road"] - group_means.loc["hybrid"]

some_bikes = __(x)__
diffs = np.array([])
for i in np.arange(10000):
    shuffled_df = some_bikes.assign(shuffled = __(y)__)  
    diffs = np.append(diffs, find_diff(shuffled_df))

Answer:

(x): bikes[(bikes.get("style") == "road") | (bikes.get("style") == "hybrid")]
(y): np.random.permutation(some bikes.get("style"))


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Problem 1.2

Do large values of the observed statistic make us lean towards the null or alternative hypothesis?

Answer: Null Hypothesis


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.


Problem 1.3

Suppose the p-value for this test evaluates to 0.04. What can you conclude based on this? Select all that apply.

Answer: Fail to reject the null hypothesis at a significance level of 0.01, Reject the null hypothesis at a significance level of 0.05.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.



Problem 2

You want to determine if the bikes for sale locally have the following distribution of "style": "electric" (15%), "mountain" (20%), "road" (40%), "hybrid" (20%), and "recumbent" (5%). You want to use the 500 rows of bikes to test the following hypotheses:


Problem 2.1

Suppose that in bikes, the "style" column is distributed as follows: "electric" (20%), "mountain" (20%), "road" (30%), "hybrid" (20%), and "recumbent" (10%).

Let’s do a hypothesis test with total variation distance (TVD) as the test statistic. Fill in the blanks below to complete a simulation that calculates 10,000 TVDs under the null hypothesis.

def total_variation_distance(distr1, distr2):
    return __(a)__

proposed = np.array([0.15, 0.20, 0.40, 0.20, 0.05])
observed = np.array([0.20, 0.20, 0.30, 0.20, 0.10])
tvds = np.array([])
for i in np.arange(10000):
    simulated = np.random.multinomial(__(b)__, __(c)__) / 500
    new_tvd = total_variation_distance(__(d)__, __(e)__)
    tvds = np.append(tvds, new_tvd)

Answer:
(a): np.abs((distr1 - distr2)).sum() / 2
(b): 500
(c): proposed
(d): simulated
(e): proposed


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 2.2

Fill in the blanks below so that observed_tvd corresponds to the observed value of the test statistic for this hypothesis test.

observed_tvd = total_variation_distance(__(f)__, __(g)__)

Answer:
(f): observed
(g): proposed


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 2.3

Which of the following correctly calculates the p-value for this hypothesis test?

Answer: p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.