← return to practice.dsc10.com

This quiz was administered in-person. It was closed-book and
closed-note; students **were not** allowed to use the DSC
10 Reference Sheet. Students had **20 minutes** to work on
the quiz.

This quiz covered Lectures 20-23 of the Spring 2024 offering
of DSC 10.

You want to know if there is a significant difference in the sale
prices of `"road"`

and `"hybrid"`

bikes using a
permutation test. The hypotheses are:

**Null**: The prices of`"road"`

and`"hybrid"`

bikes come from the same distribution.**Alt**: On average,`"hybrid"`

bikes are more expensive than`"road"`

bikes.

Using the `bikes`

DataFrame and the difference in group
means (in the order `"road"`

minus `"hybrid"`

) as
your test statistic, fill in the blanks so the code below generates
10,000 simulated statistics for the permutation test.

```
def find_diff(df):
group_means = df.groupby("shuffled").mean().get("price")
return group_means.loc["road"] - group_means.loc["hybrid"]
some_bikes = __(x)__
diffs = np.array([])
for i in np.arange(10000):
shuffled_df = some_bikes.assign(shuffled = __(y)__)
diffs = np.append(diffs, find_diff(shuffled_df))
```

**Answer**:

(x):
`bikes[(bikes.get("style") == "road") | (bikes.get("style") == "hybrid")]`

(y): `np.random.permutation(some bikes.get("style"))`

The average score on this problem was 54%.

Do large values of the observed statistic make us lean towards the null or alternative hypothesis?

Null Hypothesis

Alternative Hypothesis

**Answer**: Null Hypothesis

The average score on this problem was 63%.

Suppose the p-value for this test evaluates to 0.04. What can you conclude based on this? Select all that apply.

Reject the null hypothesis at a significance level of 0.01.

Fail to reject the null hypothesis at a significance level of 0.01.

Reject the null hypothesis at a significance level of 0.05.

Fail to reject the null hypothesis at a significance level of 0.05

**Answer**: Fail to reject the null hypothesis at a
significance level of 0.01, Reject the null hypothesis at a significance
level of 0.05.

The average score on this problem was 48%.

You want to determine if the bikes for sale locally have the
following distribution of `"style"`

: `"electric"`

(15%), `"mountain"`

(20%), `"road"`

(40%),
`"hybrid"`

(20%), and `"recumbent"`

(5%). You want
to use the 500 rows of `bikes`

to test the following
hypotheses:

**Null**: Bikes for sale locally are randomly drawn from the proposed distribution.**Alt**: Bikes for sale locally are not randomly drawn from the proposed distribution.

Suppose that in `bikes`

, the `"style"`

column
is distributed as follows: `"electric"`

(20%),
`"mountain"`

(20%), `"road"`

(30%),
`"hybrid"`

(20%), and `"recumbent"`

(10%).

Let’s do a hypothesis test with total variation distance (TVD) as the test statistic. Fill in the blanks below to complete a simulation that calculates 10,000 TVDs under the null hypothesis.

```
def total_variation_distance(distr1, distr2):
return __(a)__
proposed = np.array([0.15, 0.20, 0.40, 0.20, 0.05])
observed = np.array([0.20, 0.20, 0.30, 0.20, 0.10])
tvds = np.array([])
for i in np.arange(10000):
simulated = np.random.multinomial(__(b)__, __(c)__) / 500
new_tvd = total_variation_distance(__(d)__, __(e)__)
tvds = np.append(tvds, new_tvd)
```

**Answer**:

(a): `np.abs((distr1 - distr2)).sum() / 2`

(b): `500`

(c): `proposed`

(d): `simulated`

(e): `proposed`

The average score on this problem was 78%.

Fill in the blanks below so that `observed_tvd`

corresponds to the observed value of the test statistic for this
hypothesis test.

`observed_tvd = total_variation_distance(__(f)__, __(g)__)`

**Answer**:

(f): `observed`

(g): `proposed`

The average score on this problem was 83%.

Which of the following correctly calculates the p-value for this hypothesis test?

`p_value = np.count_nonzero(tvds >= observed_tvd) / len(observed)`

`p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)`

`p_value = np.count_nonzero(tvds <= observed_tvd) / len(observed)`

`p_value = np.count_nonzero(tvds <= observed_tvd) / len(tvds)`

**Answer**:
`p_value = np.count_nonzero(tvds >= observed_tvd) / len(tvds)`

The average score on this problem was 73%.