Lecture 20 — Practice

← return to practice.dsc10.com


Lecture 20 — Collected Practice Questions

Below are practice problems tagged for Lecture 20 (rendered directly from the original exam/quiz sources).


Problem 1

The table below shows the average amount of revenue from different sources for art museums in 2003 and 2013.


What is the total variation distance between the distribution of revenue sources in 2003 and the distribution of revenue sources in 2013? Give your answer as a proportion (i.e. a decimal between 0 and 1), not a percentage. Round your answer to three decimal places.

Answer: 0.19

Recall, the total variation distance (TVD) is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each source are as follows:

  • Admissions: |0.15 - 0.24| = 0.09
  • Restaurants and Catering: |0.09 - 0.12| = 0.03
  • Store: |0.52 - 0.33| = 0.19
  • Other: |0.24 - 0.31| = 0.07

Then, we have

\text{TVD} = \frac{1}{2} (0.09 + 0.03 + 0.19 + 0.07) = 0.19


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 2

The table below shows the average amount of revenue from different sources for art museums in 2003 and 2013.


Problem 2.1

What is the total variation distance between the distribution of revenue sources in 2003 and the distribution of revenue sources in 2013? Give your answer as a proportion (i.e. a decimal between 0 and 1), not a percentage. Round your answer to three decimal places.

Answer: 0.19

Recall, the total variation distance (TVD) is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each source are as follows:

  • Admissions: |0.15 - 0.24| = 0.09
  • Restaurants and Catering: |0.09 - 0.12| = 0.03
  • Store: |0.52 - 0.33| = 0.19
  • Other: |0.24 - 0.31| = 0.07

Then, we have

\text{TVD} = \frac{1}{2} (0.09 + 0.03 + 0.19 + 0.07) = 0.19


Difficulty: ⭐️

The average score on this problem was 95%.



Problem 2.2

Which type of visualization would be best suited for comparing the two distributions in the table?

Answer: Overlaid bar chart

A scatter plot visualizes the relationship between two numerical variables. In this problem, we only have to visualize the distribution of a categorical variable.

A line plot shows trends in numerical variables over time. In this problem, we only have categorical variables. Moreover, when it says over time, it is suitable for plotting change in multiple years (e.g. 2001, 2002, 2003, … , 2013), or even with data of days. In this question, we only want to compare the distribution of 2003 and 2013, this makes the line plot not useful. In addition, if you try to draw a line plot for this question, you will find the line plot fails to visualize distribution (e.g. the idea of 15%, 9%, 52%, and 24% add up to 100%).

An overlaid graph is useful in this question since this visualizes comparison between the two distributions.

However, an overlaid histogram is not useful in this problem. The key reason is the differences between a histogram and a bar chart.

Bar Chart: Space between the bars; 1 categorical axis, 1 numerical axis; order does not matter
Histogram: No space between the bars; intervals on axis; 2 numerical axes; order matters

In the question, we are plotting 2003 and 2013 distributions of four categories (Admissions, Restaurants and Catering, Store, and Other). Thus, an overlaid bar chart is more appropriate.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 2.3

Note: This problem is out of scope; it covers material no longer included in the course.

Notably, there was an economic recession in 2008-2009. Which of the following can we conclude was an effect of the recession?

Answer: None of the above

Since we are only given the distribution of the revenue, and have no information about the amount of revenue in 2003 and 2013, we cannot conclude how the revenue has changed from 2003 to 2013 after the recession.

For instance, if the total revenue in 2003 was 100 billion USD and the total revenue in 2013 was 50 billion USD, revenue from admissions in 2003 was 100 * 15% = 15 billion USD, and revenue from admissions in 2003 was 50 * 24% = 12 billion USD. In this case, we will have 15 > 12, the revenue from admissions has declined rather than increased (As stated by ‘The increase in revenue from admissions, as more people were visiting museums.’). Similarly, since we don’t know the total revenue in 2003 and 2013, we cannot conclude ‘The decline in revenue from museum stores, as people had less money to spend.’ or ‘The decline in total revenue, as fewer people were visiting museums.’


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Source: fa22-final — Q2

Problem 3

For each application in apps, we want to assign an age category based on the value in the "age" column, according to the table below.

"age" age category
under 25 "young adult"
at least 25, but less than 50 "middle aged"
at least 50, but less than 75 "older adult"
75 or over "elderly"
cat_names = ["young adult", "middle aged", "older adult", "elderly"]

def age_to_bin(one_age):
    '''Returns the age category corresponding to one_age.'''
    one_age = __(a)__
    bin_pos = __(b)__
    return cat_names[bin_pos]

binned_ages = __(c)__
apps_cat = apps.assign(age_category = binned_ages)


Problem 3.1

Which of the following is a correct way to fill in blanks (a) and (b)?

Blank (a) Blank (b)
Option 1 75 - one_age round(one_age / 25)
Option 2 min(75, one_age) one_age / 25
Option 3 75 - one_age int(one_age / 25)
Option 4 min(75, one_age) int(one_age / 25)
Option 5 min(74, one_age) round(one_age / 25)

Answer: Option 4

The line one_age = min(75, one_age) either leaves one_age alone or sets it equal to 75 if the age was higher than 75, which means anyone over age 75 is considered to be 75 years old for the purposes of classifying them into age categories. From the return statement, we know we need our value for bin_pos to be either 0, 1 ,2 or 3 since cat_names has a length of 4. When we divide one_age by 25, we get a decimal number that represents how many times 25 fits into one_age. We want to round this number down to get the number of whole copies of 25 that fit into one_age. If that value is 0, it means the person is a "young adult", if that value is 1, it means they are "middle aged", and so on. The rounding down behavior that we want is accomplished by int(one_age/25).


Difficulty: ⭐️⭐️

The average score on this problem was 76%.



Problem 3.2

Which of the following is a correct way to fill in blank (c)?

Answer: apps.get("age").apply(age to bin)

We want our result to be a Series because the next line in the code assigns it to a DataFrame. We also need to use the .apply() method to apply our function to the entirety of the "age" column. The .apply() method only takes in the name of a function and not its variables, as it treats the entries of the column as the variables directly.


Difficulty: ⭐️

The average score on this problem was 96%.



Problem 3.3

Which of the following is a correct alternate implementation of the age to bin function? Select all that apply.

Option 1:

def age_to_bin(one_age):
    bin_pos = 3
    if one_age < 25:
        bin_pos = 0
    if one_age < 50:
        bin_pos = 1
    if one_age < 75:
        bin_pos = 2
    return cat_names[bin_pos]

Option 2:

def age_to_bin(one_age):
    bin_pos = 3
    if one_age < 75:
        bin_pos = 2
    if one_age < 50:
        bin_pos = 1
    if one_age < 25:
        bin_pos = 0
    return cat_names[bin_pos]

Option 3:

def age_to_bin(one_age):
    bin_pos = 0
    for cutoff in np.arange(25, 100, 25):
        if one_age >= cutoff:
            bin_pos = bin_pos + 1 
    return cat_names[bin_pos]

Option 4:

def age_to_bin(one_age):
    bin_pos = -1
    for cutoff in np.arange(0, 100, 25):
        if one_age >= cutoff:
            bin_pos = bin_pos + 1 
        return cat_names[bin_pos]

Answer: Option 2 and Option 3

Option 1 doesn’t work for inputs less than 25. For example, on an input of 10, every condition is satsified, so bin_pos will be set to 0, then 1, then 2, meaning the function will return "older adult" instead of "young adult".

Option 2 reverses the order of the conditions, which ensures that even when a number satisfies many conditions, the last one it satisfies determines the correct bin_pos. For example, 27 would satisfy the first 2 conditions but not the last one, and the function would return "middle aged" as expected.

In option 3, np.arange(25, 100, 25) produces np.array([25,50,75]). The if condition checks the whether the age is at least 25, then 50, then 75. For every time that it is, it adds to bin_pos, otherwise it keeps bin_pos. At the end, bin_pos represents the number of these values that the age is greater than or equal to, which correctly determines the age category.

Option 4 is equivalent to option 3 except for two things. First, bin_pos starts at -1, but since 0 is included in the set of cutoff values, the first time through the loop will set bin_pos to 0, as in Option 3. This change doesn’t affect the behavior of the funtion. The other change, however, is that the return statement is inside the for-loop, which does change the behavior of the function dramatically. Now the for-loop will only run once, checking whether the age is at least 0 and then returning immediately. Since ages are always at least 0, this function will return "young adult" on every input, which is clearly incorrect.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 3.4

We want to determine the number of "middle aged" applicants whose applications were denied. Fill in the blank below so that count evaluates to that number.

df = apps_cat.________.reset_index()
count = df[(df.get("age_category") == "middle aged") & 
           (df.get("status") == "denied")].get("income").iloc[0]

What goes in the blank?

Answer: groupby(["age_category", "status"]).count()

We can tell by the line in which count is defined that df needs to have columns called "age category", "status", and "income" with one row such that the values in these columns are "middle aged", "denied", and the number of such applicants, respectively. Since there is one row corresponding to a possible combination of values for "age category" and "status", this suggests we need to group by the pair of columns, since .groupby produces a DataFrame with one row for each possible combination of values in the columns we are grouping by. Since we want to know how many individuals have this combination of values for "age category" and "status", we should use .count() as the aggregation method. Another clue to to use .groupby is the presence of .reset_index() which is needed to query based on columns called "age category" and "status".


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 3.5

The total variation distance between the distributions of "age category" for approved applications and denied applications is 0.4.

One of the visualizations below shows the distributions of "age category" for approved applications and denied applications. Which visualization is it?


Answer: Option 2

TVD represents the total overrepresentation of one distrubtion, summed across all categories. To find the TVD visually, we can estimate how much each bar for approved applications extends beyond the corresponding bar for denied applications in each bar chart.

In Option 1, the approved bar extends beyond the denied bar only in the "young adult" category, and by 0.2, so the TVD for Option 1 is 0.2. In Option 2, the approved bar extends beyond the denied bar only in the "older adult" category, and by 0.4, so the TVD for Option 2 is 0.4. In Option 3, the approved bar extends beyond the denied bar in "elderly" by 0.2 and in "young adult" by 0.4, for a TVD of 0.6. In Option 4, the approved bar extends beyond the denied bar in "young adult only" by 0.2, for a TVD of 0.2.

Note that even without knowing the exact lengths of the bars in Option 2, we can still conclude that Option 2 is correct by process of elimination, since it’s the only one whose TVD appears close to 0.4


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.



Source: fa23-final — Q6

Problem 4

Aaron wants to explore the discrepancy in fraud rates between "discover" transactions and "mastercard" transactions. To do so, he creates the DataFrame ds_mc, which only contains the rows in txn corresponding to "mastercard" or "discover" transactions.

After he creates ds_mc, Aaron groups ds_mc on the "card" column using two different aggregation methods. The relevant columns in the resulting DataFrames are shown below.


Aaron decides to perform a test of the following pair of hypotheses:

As his test statistic, Aaron chooses the difference in proportion of transactions that are fraudulent, in the order "mastercard" minus "discover".


Problem 4.1

What type of statistical test is Aaron performing?

Answer: Permutation test

Permutation tests are used to ascertain whether two samples were drawn from the same population. Hypothesis testing is used when we have a single sample and a known population, and want to determine whether the sample appears to have been drawn from that population. Here, we have two samples (“mastercard” and “discover”) and no known population distribution, so a permutation test is the appropriate test.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.


Problem 4.2

What is the value of the observed statistic? Give your answer either as an exact decimal or simplified fraction.

Answer: 0.02

We simply take the difference in fraudulent proportion of "mastercard" transactions and fraudulent proportion of discover transactions. There are 4,000 fraudulent "mastercard" transactions and 40,000 total "mastercard" transactions, making this proportion for "mastercard". Similarly, the proportion of fraudulent "discover" transactions is \frac{160}{2000}. Simplifying these fractions, the difference between them is \frac{1}{10} - \frac{8}{100} = 0.1 - 0.08 = 0.02.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


The empirical distribution of Aaron’s chosen test statistic is shown below.


Problem 4.3

Which of the following is closest to the p-value of Aaron’s test?

Answer: 0.999

Informally, the p-value is the area of the histogram at or past the observed statistic, further in the direction of the alternative hypothesis. In this case, the alternative hypothesis is that the "mastercard" proportion is less than the discover proportion, and our test statistic is computed in the order "mastercard" minus "discover", so low (negative) values correspond to the alternative. This means when calculating the p-value, we look at the area to the left of 0.02 (the observed value). We see that essentially all of the test statistics fall to the left of this value, so the p-value should be closest to 0.999.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Problem 4.4

What is the conclusion of Aaron’s test?

Answer: None of the above

  • Option A: Since the p-value was so high, it’s unlikely that the proportion of fraudulent "mastercard" transactions is less than the proportion of fraudulent "discover" transactions, so we cannot conclude A.
  • Option B: The test does not allow us to conclude this, because it was not one of the hypotheses. All we can say is that we don’t think the alternative hypothesis is true - we can’t say whether any other statement is true.
  • Option C: The test did give us valuable information about the difference in fraud rates: we failed to reject the null hypothesis. So, the test is conclusive, making option C incorrect. Therefore, option D (none of the above) is correct.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.


Aaron now decides to test a slightly different pair of hypotheses.

He uses the same test statistic as before.


Problem 4.5

Which of the following is closest to the p-value of Aaron’s new test?

Answer: 0.001

Now, we have switched the alternative hypothesis to “ "mastercard" fraud rate is greater than "discover" fraud rate”, whereas before our alternative hypothesis was that the "mastercard" fraud rate was less than "discover"’s fraud rate. We have not changed the way we calculate the test statistic ("mastercard" minus "discover"), so now large values of the test statistic correspond to the alternative hypothesis. So, the area of interest is the area to the right of 0.02, which is very small, close to 0.001. Note that this is one minus the p-value we computed before.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.



Source: fa23-final — Q7

Problem 5

Jason is interested in exploring the relationship between the browser and payment method used for a transaction. To do so, he uses txn to create create three tables, each of which contain the distribution of browsers used for credit card transactions and the distribution of browsers used for debit card transactions, but with different combinations of browsers combined in a single category in each table.


Jason calculates the total variation distance (TVD) between the two distributions in each of his three tables, but he does not record which TVD goes with which table. He computed TVDs of 0.14, 0.36, and 0.38.


Problem 5.1

In which table do the two distributions have a TVD of 0.14?

Answer: Table 3

Without values in any of the tables, there’s no way to do this problem computationally. We are told that the three TVDs come out to 0.14, 0.36, and 0.38. The exact numbers are not important but their relative order is. The key to this problem is noticing that when we combine two categories into one, the TVD can only decrease, it cannot increase. One way to see this is to think about combining categories repeatedly until there’s just one category. Then both distributions must have a value of 1 in that category so they are identical distributions with the smallest possible TVD of 0. As we collapse categories, we can only decrease the TVD. This tells us that Table 1 has the largest TVD, then Table 2 has the middle TVD, and Table 3 has the smallest, since each time we are combining categories and shrinking the TVD.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 5.2

In which table do the two distributions have a TVD of 0.36?

Answer: Table 2

See the solution to 7.1.


Difficulty: ⭐️

The average score on this problem was 97%.


Problem 5.3

In which table do the two distributions have a TVD of 0.38?

Answer: Table 1

See the solution to 7.1.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.



Source: fa24-final — Q7

Problem 6

Hargen is an employee at Bill’s Book Bonanza who tends to work weekend shifts. He thinks that Fridays, Saturdays, and Sundays are busier than other days, and he proposes the following probability distribution of sales by day:

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
0.2 0.1 0.1 0.1 0.1 0.2 0.2

Let’s use the data in sales to determine whether Hargen’s proposed model could be correct by doing a hypothesis test. The hypotheses are:


Problem 6.1

Which of the following test statistics could be used to test the given hypothesis? Select all that apply.

Answer:

  • The sum of the squared differences in proportions between the distribution of books sold by day and Hargen’s proposed distribution.
  • One half of the sum of the absolute differences in proportions between the distribution of books sold by day and Hargen’s proposed distribution.

Let’s look at each of the options:

  • Option 1: Incorrect. It is only comparing one day instead of the entire distribution. This is insufficient to tell if two distributions are similar. For example, they could have similar Saturday proportions, but very different proportions for all other days of the week.
  • Option 2: Incorrect. Positive and negative differences will cancel out. Since both distributions sum to 1, their difference will sum to 0. This means this statsistic tells us nothing about how similar the two distributions are, since the result will be 0 every time.
  • Option 3: Correct. Squaring the difference avoids the issue of positive and negative differences canceling out.
  • Option 4: Correct. Absolute difference also avoids the issue of positive and negative differences canceling out. Furthermore, this is the definition of Total Variation Distance (TVD), which is a common statistic for comparing categorical distributions.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


We will use as our test statistic the mean of the absolute differences in proportions between the distribution of books sold by day and Hargen’s proposed distribution.


Problem 6.2

Suppose the observed distribution of books sold by day was as follows. Calculate the observed statistic in this case.

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
0.34 0.13 0.06 0.07 0.08 0.08 0.24

Answer: 0.06

\begin{align*} \text{mean abs diff} &= \frac{|0.34 - 0.2| + |0.13 - 0.1| + |0.06 - 0.1| + |0.07 - 0.1| + |0.08 - 0.1| + |0.08 - 0.2| + |0.24 - 0.2|}{7}\\ &= \frac{0.14 + 0.03 + 0.04 + 0.03 + 0.02 + 0.12 + 0.04}{7}\\ &= \frac{0.42}{7} \\ &= 0.06 \end{align*}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 6.3

Let’s determine the actual value of the observed statistic based on the data in sales. Assume that we have already defined a function called find_day that returns the day of the week for a given "date". For example, find_day("Saturday, December 7, 2024") evaluates to "Saturday". Fill in the blanks below so that the variable obs evaluates to the observed statistic.

# in alphabetical order: Fri, Mon, Sat, Sun, Thurs, Tues, Wed
hargen = np.array([0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.1]) 
prop = (sales.assign(day_of_week = __(a)__)
             .groupby(__(b)__).__(c)__.get("ISBN") / sales.shape[0])
obs = __(d)__

(a). Answer: sales.get("date").apply(find_day)

In this blank, we want to create a Series that contains days of the week, such as "Saturday", to be assigned to a column named day_of_week in sales. We take the "date" column in sales and apply the function find_day to each of the date in the column.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.

(b). Answer: "day_of_week"

We want to group the sales DataFrame by the day_of_week column that was created in blank (a) to collect together all rows corresponding to the same day of the week.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.

(c). Answer: count()

We want to count how many sales occured on each day of the week, or, how many rows are in sales that belong to each day, so we use count() after grouping by day_of_week.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.

(d). Answer: (np.abs(prop - hargen)).mean()

We first take one column that contains the count of rows for each day of the week, indexed by the days in alphabetical order after the groupby. We then divide this column by the total number of rows to get proportions. We want to compute the statistic that we’ve chosen, the mean of absolute differences in proportions between the observed and Hargen’s proposed distribution. Since the order of the days already match (both in alphabetical order), we can simply subtract one from the other to get the difference in proportions. We then turn the differences into absolute differences with np.abs and get the mean using .mean().


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 6.4

To conduct the hypothesis test, we’ll need to generate thousands of simulated day-of-the-week distributions. What will we do with all these simulated distributions?

Answer: Use them to determine whether the observed distribution of books sold by day looks like a typical simulated distribution.

For hypothesis testing, we simulate based on the null distribution, which is the distribution that Hargen proposed. For each of the simulated distribution of proportions , we calculate the statistic we chose. After many simulations, we have calculated thousands of these statistics, each between one simulated distribution and Hargen’s proposed distribution. Lastly, we compare the observed statistic with the statistics from the simulations to see whether the observed distribution of books sold by day looks like a typical simulated distribution.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 28%.


Problem 6.5

In each iteration of the simulation, we need to collect a sample of size sales.shape[0] from which distribution?

Answer: Hargen’s proposed distribution.

For hypothesis testing, we simulate based on the null distribution, which is the distribution that Hargen proposed.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 26%.


Problem 6.6

Suppose that obs comes out to be in the 98th percentile of simulated test statistics. Using the standard p-value cutoff of 0.05, what would Hargen conclude about his original model?

Answer: It is likely wrong.

Using the standard p-value cutoff of 0.05, we say that the observed distribution is not like a typical simulated distribution under Hargen’s proposed distribution if it falls below 2.5th percentile of above 97.5th percentile. In this case, 98th percentile is above 97.5th percentile, so we say that Hargen’s proposed distribution is likely wrong.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.



Source: fa24-final — Q10

Problem 7


Problem 7.1

Suppose the bookstore DataFrame has 10 unique genres, and we are given a sample of 350 books from that DataFrame. Determine the maximum possible total variation distance (TVD) that could occur between our sample’s genre distribution and the uniform distribution where each genre occurs with equal probability. Your answer should be a single number.

Answer: 0.9

To determine the maximum possible TVD, we consider the scenario where all books belong to a single genre. This represents the maximum deviation from the uniform distribution:

  • Uniform distribution: 0.1 probability for all genres
  • Single genre distribution: 1 probability for one genre and 0 probability for all others

TVD = \frac{1}{2} \left( |1 - 0.1| + 9 \times |0 - 0.1| \right) = \frac{1}{2} (0.9 + 0.9) = 0.9


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 31%.


Problem 7.2

True or False: If the sample instead had 700 books, then the maximum possible TVD would increase.

Answer: False

The maximum possible TVD is based on proportions and not absolute counts. Even if the sample size is increased, the TVD would remain the same.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Problem 7.3

True or False: If the bookstore DataFrame had 11 genres instead of 10, the maximum possible TVD would increase.

Answer: True

With 11 genres, the uniform probability per genre decreases to \frac{1}{11} instead of \frac{1}{10} with 10 genres. In the extreme scenario where one genre dominates, the TVD is now bigger.

TVD = \frac{1}{2} \left( |1 - \frac{1}{11}| + 10 \times |0 - \frac{1}{11}| \right) = \frac{1}{2} (\frac{10}{11} + \frac{10}{11}) = \frac{10}{11}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 8

For this question, let’s think of the data in app_data as a random sample of all IKEA purchases and use it to test the following hypotheses.

Null Hypothesis: IKEA sells an equal amount of beds (category 'bed') and outdoor furniture (category 'outdoor').

Alternative Hypothesis: IKEA sells more beds than outdoor furniture.

The DataFrame app_data contains 5000 rows, which form our sample. Of these 5000 products,


Problem 8.1

Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.

Answer: Among 2500 beds and outdoor furniture items, the proportion of beds.
Among 2500 beds and outdoor furniture items, the number of beds.

Our test statistic needs to be able to distinguish between the two hypotheses. The first option does not do this, because it includes an absolute value. If the absolute difference between the proportion of beds and the proportion of outdoor furniture were large, it could be because IKEA sells more beds than outdoor furniture, but it could also be because IKEA sells more outdoor furniture than beds.

The second option is a valid test statistic, because if the proportion of beds is large, that suggests that the alternative hypothesis may be true.

Similarly, the third option works because if the number of beds (out of 2500) is large, that suggests that the alternative hypothesis may be true.

The fourth option is invalid because out of 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items is always 2500. So the value of this statistic is constant regardless of whether the alternative hypothesis is true, which means it does not help you distinguish between the two hypotheses.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 8.2

Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.

Complete the code below to calculate the observed value of the test statistic and save the result as obs_diff.

    outdoor = (app_data.get('category')=='outdoor') 
    bed = (app_data.get('category')=='bed')
    obs_diff = ( ___(a)___ - ___(b)___ ) / ___(c)___

The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.

Answer: Reading the table from top to bottom, the five expressions should be used in the following blanks: None, (b), (a), (c), None.

The correct way to define obs_diff is

    outdoor = (app_data.get('category')=='outdoor') 
    bed = (app_data.get('category')=='bed')
    obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

The first provided line of code defines a boolean Series called outdoor with a value of True corresponding to each outdoor furniture item in app_data. Using this as the condition in a query results in a DataFrame of outdoor furniture items, and using .shape[0] on this DataFrame gives the number of outdoor furniture items. So app_data[outdoor].shape[0] represents the number of outdoor furniture items in app_data. Similarly, app_data[bed].shape[0] represents the number of beds in app_data. Likewise, app_data[outdoor | bed].shape[0] represents the total number of outdoor furniture items and beds in app_data. Notice that we need to use an or condition (|) to get a DataFrame that contains both outdoor furniture and beds.

We are told that the test statistic should be the proportion of outdoor furniture minus the proportion of beds. Translating this directly into code, this means the test statistic should be calculated as

    obs_diff = app_data[outdoor].shape[0]/app_data[outdoor | bed].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Since this is a difference of two fractions with the same denominator, we can equivalently subtract the numerators first, then divide by the common denominator, using the mathematical fact \frac{a}{c} - \frac{b}{c} = \frac{a-b}{c}.

This yields the answer

    obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Notice that this is the observed value of the test statistic because it’s based on the real-life data in the app_data DataFrame, not simulated data.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 8.3

Suppose we generate 10,000 simulated values of the test statistic according to the null model and store them in an array called simulated_diffs. Complete the code below to calculate the p-value for the hypothesis test.

    np.count_nonzero(simulated_diffs _________ obs_diff)/10000

What goes in the blank?

Answer: <=

To answer this question, we need to know whether small values or large values of the test statistic indicate the alternative hypothesis. The alternative hypothesis is that IKEA sells more beds than outdoor furniture. Since we’re calculating the proportion of outdoor furniture minus the proportion of beds, this difference will be small (negative) if the alternative hypothesis is true. Larger (positive) values of the test statistic mean that IKEA sells more outdoor furniture than beds. A value near 0 means they sell beds and outdoor furniture equally.

The p-value is defined as the proportion of simulated test statistics that are equal to the observed value or more extreme, where extreme means in the direction of the alternative. In this case, since small values of the test statistic indicate the alternative hypothesis, the correct answer is <=.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.



Problem 9

For this question, let’s think of the data in app_data as a random sample of all IKEA purchases and use it to test the following hypotheses.

Null Hypothesis: IKEA sells an equal amount of beds (category 'bed') and outdoor furniture (category 'outdoor').

Alternative Hypothesis: IKEA sells more beds than outdoor furniture.

The DataFrame app_data contains 5000 rows, which form our sample. Of these 5000 products,


Problem 9.1

Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.

Answer: Among 2500 beds and outdoor furniture items, the proportion of beds.
Among 2500 beds and outdoor furniture items, the number of beds.

Our test statistic needs to be able to distinguish between the two hypotheses. The first option does not do this, because it includes an absolute value. If the absolute difference between the proportion of beds and the proportion of outdoor furniture were large, it could be because IKEA sells more beds than outdoor furniture, but it could also be because IKEA sells more outdoor furniture than beds.

The second option is a valid test statistic, because if the proportion of beds is large, that suggests that the alternative hypothesis may be true.

Similarly, the third option works because if the number of beds (out of 2500) is large, that suggests that the alternative hypothesis may be true.

The fourth option is invalid because out of 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items is always 2500. So the value of this statistic is constant regardless of whether the alternative hypothesis is true, which means it does not help you distinguish between the two hypotheses.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 9.2

Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.

Complete the code below to calculate the observed value of the test statistic and save the result as obs_diff.

    outdoor = (app_data.get('category')=='outdoor') 
    bed = (app_data.get('category')=='bed')
    obs_diff = ( ___(a)___ - ___(b)___ ) / ___(c)___

The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.

Answer: Reading the table from top to bottom, the five expressions should be used in the following blanks: None, (b), (a), (c), None.

The correct way to define obs_diff is

    outdoor = (app_data.get('category')=='outdoor') 
    bed = (app_data.get('category')=='bed')
    obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

The first provided line of code defines a boolean Series called outdoor with a value of True corresponding to each outdoor furniture item in app_data. Using this as the condition in a query results in a DataFrame of outdoor furniture items, and using .shape[0] on this DataFrame gives the number of outdoor furniture items. So app_data[outdoor].shape[0] represents the number of outdoor furniture items in app_data. Similarly, app_data[bed].shape[0] represents the number of beds in app_data. Likewise, app_data[outdoor | bed].shape[0] represents the total number of outdoor furniture items and beds in app_data. Notice that we need to use an or condition (|) to get a DataFrame that contains both outdoor furniture and beds.

We are told that the test statistic should be the proportion of outdoor furniture minus the proportion of beds. Translating this directly into code, this means the test statistic should be calculated as

    obs_diff = app_data[outdoor].shape[0]/app_data[outdoor | bed].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Since this is a difference of two fractions with the same denominator, we can equivalently subtract the numerators first, then divide by the common denominator, using the mathematical fact \frac{a}{c} - \frac{b}{c} = \frac{a-b}{c}.

This yields the answer

    obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Notice that this is the observed value of the test statistic because it’s based on the real-life data in the app_data DataFrame, not simulated data.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 9.3

Which of the following is a valid way to generate one value of the test statistic according to the null model? Select all that apply.

Way 1:

multi = np.random.multinomial(2500, [0.5,0.5]) 
(multi[0] - multi[1])/2500

Way 2:

outdoor = np.random.multinomial(2500, [0.5,0.5])[0]/2500 
bed = np.random.multinomial(2500, [0.5,0.5])[1]/2500 
outdoor - bed 

Way 3:

choice = np.random.choice([0, 1], 2500, replace=True) 
choice_sum = choice.sum() 
(choice_sum - (2500 - choice_sum))/2500

Way 4:

choice = np.random.choice(['bed', 'outdoor'], 2500, replace=True) 
bed = np.count_nonzero(choice=='bed')
outdoor = np.count_nonzero(choice=='outdoor')
outdoor/2500 - bed/2500

Way 5:

outdoor = (app_data.get('category')=='outdoor') 
bed = (app_data.get('category')=='bed')
samp = app_data[outdoor|bed].sample(2500, replace=True) 
samp[samp.get('category')=='outdoor'].shape[0]/2500 -  samp[samp.get('category')=='bed'].shape[0]/2500)

Way 6:

outdoor = (app_data.get('category')=='outdoor') 
bed = (app_data.get('category')=='bed')
samp = (app_data[outdoor|bed].groupby('category').count().reset_index().sample(2500, replace=True))    
samp[samp.get('category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500

Answer: Way 1, Way 3, Way 4, Way 6

Let’s consider each way in order.

Way 1 is a correct solution. This code begins by defining a variable multi which will evaluate to an array with two elements representing the number of items in each of the two categories, after 2500 items are drawn randomly from the two categories, with each category being equally likely. In this case, our categories are beds and outdoor furniture, and the null hypothesis says that each category is equally likely, so this describes our scenario accurately. We can interpret multi[0] as the number of outdoor furniture items and multi[1] as the number of beds when we draw 2500 of these items with equal probability. Using the same mathematical fact from the solution to Problem 8.2, we can calculate the difference in proportions as the difference in number divided by the total, so it is correct to calculate the test statistic as (multi[0] - multi[1])/2500.

Way 2 is an incorrect solution. Way 2 is based on a similar idea as Way 1, except it calls np.random.multinomial twice, which corresponds to two separate random processes of selecting 2500 items, each of which is equally likely to be a bed or an outdoor furniture item. However, is not guaranteed that the number of outdoor furniture items in the first random selection plus the number of beds in the second random selection totals 2500. Way 2 calculates the proportion of outdoor furniture items in one random selection minus the proportion of beds in another. What we want to do instead is calculate the difference between the proportion of outdoor furniture and beds in a single random draw.

Way 3 is a correct solution. Way 3 does the random selection of items in a different way, using np.random.choice. Way 3 creates a variable called choice which is an array of 2500 values. Each value is chosen from the list [0,1] with each of the two list elements being equally likely to be chosen. Of course, since we are choosing 2500 items from a list of size 2, we must allow replacements. We can interpret the elements of choice by thinking of each 1 as an outdoor furniture item and each 0 as a bed. By doing so, this random selection process matches up with the assumptions of the null hypothesis. Then the sum of the elements of choice represents the total number of outdoor furniture items, which the code saves as the variable choice_sum. Since there are 2500 beds and outdoor furniture items in total, 2500 - choice_sum represents the total number of beds. Therefore, the test statistic here is correctly calculated as the number of outdoor furniture items minus the number of beds, all divided by the total number of items, which is 2500.

Way 4 is a correct solution. Way 4 is similar to Way 3, except instead of using 0s and 1s, it uses the strings 'bed' and 'outdoor' in the choice array, so the interpretation is even more direct. Another difference is the way the number of beds and number of outdoor furniture items is calculated. It uses np.count_nonzero instead of sum, which wouldn’t make sense with strings. This solution calculates the proportion of outdoor furniture minus the proportion of beds directly.

Way 5 is an incorrect solution. As described in the solution to Problem 8.2, app_data[outdoor|bed] is a DataFrame containing just the outdoor furniture items and the beds from app_data. Based on the given information, we know app_data[outdoor|bed] has 2500 rows, 1000 of which correspond to beds and 1500 of which correspond to furniture items. This code defines a variable samp that comes from sampling this DataFrame 2500 times with replacement. This means that each row of samp is equally likely to be any of the 2500 rows of app_data[outdoor|bed]. The fraction of these rows that are beds is 1000/2500 = 2/5 and the fraction of these rows that are outdoor furniture items is 1500/2500 = 3/5. This means the random process of selecting rows randomly such that each row is equally likely does not make each item equally likely to be a bed or outdoor furniture item. Therefore, this approach does not align with the assumptions of the null hypothesis.

Way 6 is a correct solution. Way 6 essentially modifies Way 5 to make beds and outdoor furniture items equally likely to be selected in the random sample. As in Way 5, the code involves the DataFrame app_data[outdoor|bed] which contains 1000 beds and 1500 outdoor furniture items. Then this DataFrame is grouped by 'category' which results in a DataFrame indexed by 'category', which will have only two rows, since there are only two values of 'category', either 'outdoor' or 'bed'. The aggregation function .count() is irrelevant here. When the index is reset, 'category' becomes a column. Now, randomly sampling from this two-row grouped DataFrame such that each row is equally likely to be selected does correspond to choosing items such that each item is equally likely to be a bed or outdoor furniture item. The last line simply calculates the proportion of outdoor furniture items minus the proportion of beds in our random sample drawn according to the null model.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 9.4

Suppose we generate 10,000 simulated values of the test statistic according to the null model and store them in an array called simulated_diffs. Complete the code below to calculate the p-value for the hypothesis test.

    np.count_nonzero(simulated_diffs _________ obs_diff)/10000

What goes in the blank?

Answer: <=

To answer this question, we need to know whether small values or large values of the test statistic indicate the alternative hypothesis. The alternative hypothesis is that IKEA sells more beds than outdoor furniture. Since we’re calculating the proportion of outdoor furniture minus the proportion of beds, this difference will be small (negative) if the alternative hypothesis is true. Larger (positive) values of the test statistic mean that IKEA sells more outdoor furniture than beds. A value near 0 means they sell beds and outdoor furniture equally.

The p-value is defined as the proportion of simulated test statistics that are equal to the observed value or more extreme, where extreme means in the direction of the alternative. In this case, since small values of the test statistic indicate the alternative hypothesis, the correct answer is <=.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.



Problem 10

There are 52 IKEA locations in the United States, and there are 50 states.

Which of the following describes how to calculate the total variation distance between the distribution of IKEA locations by state and the uniform distribution?

Answer: For each state, take the absolute difference between 1/50 and the number of IKEAs in that state divided by the total number of IKEA locations. Sum these values across all states and divide by two.

We’re looking at the distribution across states. Since there are 50 states, the uniform distribution would correspond to having a fraction of 1/50 of the IKEA locations in each state. We can picture this as a sequence with 50 entries that are all the same: (1/50, 1/50, 1/50, 1/50, \dots)

We want to compare this to the actual distribution of IKEAs across states, which we can think of as a sequence with 50 entries, representing the 50 states, but where each entry is the proportion of IKEA locations in a given state. For example, maybe the distribution starts out like this: (3/52, 1/52, 0/52, 1/52, \dots) We can interpret each entry as the number of IKEAs in a state divided by the total number of IKEA locations. Note that this has nothing to do with the average number of IKEA locations in each state, which is 52/50.

The way we take the TVD of two distributions is to subtract the distributions, take the absolute value, sum up the values, and divide by 2. Since the entries represent states, this process aligns with the given answer.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 30%.


Source: sp23-final — Q5

Problem 11

In some cities, the number of sunshine hours per month is relatively consistent throughout the year. São Paulo, Brazil is one such city; in all months of the year, the number of sunshine hours per month is somewhere between 139 and 173. New York City’s, on the other hand, ranges from 139 to 268.

Gina and Abel, both San Diego natives, are interested in assessing how “consistent" the number of sunshine hours per month in San Diego appear to be. Specifically, they’d like to test the following hypotheses:

As their test statistic, Gina and Abel choose the total variation distance. To simulate samples under the null, they will sample from a categorical distribution with 12 categories — January, February, and so on, through December — each of which have an equal probability of being chosen.


Problem 11.1

In order to run their hypothesis test, Gina and Abel need a way to calculate their test statistic. Below is an incomplete implementation of a function that computes the TVD between two arrays of length 12, each of which represent a categorical distribution.

    def calculate_tvd(dist1, dist2):
        return np.mean(np.abs(dist1 - dist2)) * ____

Fill in the blank so that calculate_tvd works as intended.

Answer: 6

The TVD is the sum of the absolute differences in proportions, divided by 2. In the code to the left of the blank, we’ve computed the mean of the absolute differences in proportions, which is the same as the sum of the absolute differences in proportions, divided by 12 (since len(dist1) is 12). To correct the fact that we divided by 12, we multiply by 6, so that we’re only dividing by 2.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 17%.


Moving forward, assume that calculate_tvd works correctly.

Now, complete the implementation of the function uniform_test, which takes in an array observed_counts of length 12 containing the number of sunshine hours each month in a city and returns the p-value for the hypothesis test stated at the start of the question.

    def uniform_test(observed_counts):
        # The values in observed_counts are counts, not proportions!
        total_count = observed_counts.sum()
        uniform_dist = __(b)__
        tvds = np.array([])
        for i in np.arange(10000):
            simulated = __(c)__
            tvd = calculate_tvd(simulated, __(d)__)
            tvds = np.append(tvds, tvd)
        return np.mean(tvds __(e)__ calculate_tvd(uniform_dist, __(f)__))


Problem 11.2

What goes in blank (b)? (Hint: The function np.ones(k) returns an array of length k in which all elements are 1.)

Answer: np.ones(12) / 12

uniform_dist needs to be the same as the uniform distribution provided in the null hypothesis, \left[\frac{1}{12}, \frac{1}{12}, ..., \frac{1}{12}\right].

In code, this is an array of length 12 in which each element is equal to 1 / 12. np.ones(12) creates an array of length 12 in which each value is 1; for each value to be 1 / 12, we divide np.ones(12) by 12.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 11.3

What goes in blank (c)?

Answer: np.random.multinomial(total_count, uniform_dist) / total_count

The idea here is to repeatedly generate an array of proportions that results from distributing total_count hours across the 12 months in a way that each month is equally likely to be chosen. Each time we generate such an array, we’ll determine its TVD from the uniform distribution; doing this repeatedly gives us an empirical distribution of the TVD under the assumption the null hypothesis is true.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 21%.



Problem 11.4

What goes in blank (d)?

Answer: uniform_dist

As mentioned above:

Each time we generate such an array, we’ll determine its TVD from the uniform distribution; doing this repeatedly gives us an empirical distribution of the TVD under the assumption the null hypothesis is true.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.



Problem 11.5

What goes in blank (e)?

Answer: >=

The purpose of the last line of code is to compute the p-value for the hypothesis test. Recall, the p-value of a hypothesis test is the proportion of simulated test statistics that are as or more extreme than the observed test statistic, under the assumption the null hypothesis is true. In this context, “as extreme or more extreme” means the simulated TVD is greater than or equal to the observed TVD (since larger TVDs mean “more different”).


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 11.6

What goes in blank (f)?

Answer: observed_counts / total_count or observed_counts / observed_counts.sum()

Blank (f) needs to contain the observed distribution of sunshine hours (as an array of proportions) that we compare against the uniform distribution to calculate the observed TVD. This observed TVD is then compared with the distribution of simulated TVDs to calculate the p-value. The observed counts are converted to proportions by dividing by the total count so that the observed distribution is on the same scale as the simulated and expected uniform distributions, which are also in proportions.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.



Source: sp23-final — Q10

Problem 12

Costin, a San Francisco native, will be back in San Francisco over the summer, and is curious as to whether it is true that about \frac{3}{4} of days in San Francisco are sunny.

Fast forward to the end of September: Costin counted that of the 30 days in September, 27 were sunny in San Francisco. To test his theory, Costin came up with two pairs of hypotheses.

Pair 1:

Pair 2:

For each test statistic below, choose whether the test statistic could be used to test Pair 1, Pair 2, both, or neither. Assume that all days are either sunny or cloudy, and that we cannot perform two-tailed hypothesis tests. (If you don’t know what those are, you don’t need to!)


Problem 12.1

The difference between the number of sunny days and number of cloudy days

Answer: Pair 2

The test statistic provided is the difference between the number of sunny days and cloudy days in a sample of 30 days. Since each day is either sunny or cloudy, the number of cloudy days is just 30 - the number of sunny days. This means we can re-write our test statistic as follows:

\begin{align*} &\text{number of sunny days} - \text{number of cloudy days} \\ &= \text{number of sunny days} - (30 - \text{number of sunny days}) \\ &= 2 \cdot \text{number of sunny days} - 30 \\ &= 2 \cdot (\text{number of sunny days} - 15) \end{align*}

The more sunny days there are in our sample of 30 days, the larger this test statistic will be. (Specifically, if there are more sunny days than cloudy days, this will be positive; if there’s an equal number of sunny and cloudy days, this will be 0, and if there are more cloudy days, this will be negative.)

Now, let’s look at each pair of hypotheses.

Pair 1:

Pair 1’s alternative hypothesis is that the probability of a sunny day is not \frac{3}{4}, which includes both greater than and less than \frac{3}{4}.

To test this pair of hypotheses, we need a test statistic that is large when the number of sunny days is far from \frac{3}{4} (evidence for the alternative hypothesis) and small when the number of sunny days is close to \frac{3}{4} (evidence for the null hypothesis). (It would also be acceptable to design a test statistic that is small when the number of sunny days is far from \frac{3}{4} and large when it’s close to \frac{3}{4}, but the first option we’ve outlined is a bit more natural.)

Our chosen test statistic, 2 \cdot (\text{number of sunny days} - 15), doesn’t work this way; both very large values and very small values indicate that the proportion of sunny days is far from \frac{3}{4}, and since we can’t use two-tailed tests, we can’t use our test statistic for this pair.

Pair 2:

Pair 2’s alternative hypothesis is that the probability of a sunny day greater than \frac{3}{4}.

Since our test statistic is large when the number of sunny days is large (evidence for the alternative hypothesis) and is small when the number of sunny days is small (evidence for the null hypothesis), we can use our test statistic to test this pair of hypotheses. The key difference between Pair 1 and Pair 2 is that Pair 2’s alternative hypothesis has a direction – it says that the probability that it is sunny on any given day is greater than \frac{3}{4}, rather than just “not” \frac{3}{4}.

Thus, we can use this test statistic to test Pair 2, but not Pair 1.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 28%.



Problem 12.2

The absolute difference between the number of sunny days and number of cloudy days

Answer: Neither

The test statistic here is the absolute value of the test statistic in the first part. Since we were able to re-write the test statistic in the first part as 2 \cdot (\text{number of sunny days} - 15), our test statistic here is |2 \cdot (\text{number of sunny days} - 15)|, or, since 2 already non-negative,

2 \cdot | \text{number of sunny days} - 15 |

This test statistic is large when the number of sunny days is far from 15, i.e. when the number of sunny days and number of cloudy days are far apart, or when the proportion of sunny days is far from \frac{1}{2}. However, the null hypothesis we’re testing here is not that the proportion of sunny days is \frac{1}{2}, but that the proportion of sunny days is \frac{3}{4}.

A large value of this test statistic will tell us the proportion of sunny days is far from \frac{1}{2}, but it may or may not be far from \frac{3}{4}. For instance, when \text{number of sunny days} = 7, then our test statistic is 2 \cdot | 7 - 15 | = 16. When \text{number of sunny days} = 23, our test statistic is also 16. However, in the first case, the proportion of sunny days is just under \frac{1}{4} (far from \frac{3}{4}), while in the second case the proportion of sunny days is just above \frac{3}{4}.

In both pairs of hypotheses, this test statistic isn’t set up such that large values point to one hypothesis and small values point to the other, so it can’t be used to test either pair.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 25%.



Problem 12.3

The difference between the proportion of sunny days and \frac{1}{4}

Answer: Pair 2

The test statistic here is the difference between the proportion of sunny days and \frac{1}{4}. This means if p is the proportion of sunny days, the test statistic is p - \frac{1}{4}. This test statistic is large when the proportion of sunny days is large and small when the proportion of sunny days is small. (The fact that we’re subtracting by \frac{1}{4} doesn’t change this pattern – all it does is shift both the empirical distribution of the test statistic and our observed statistic \frac{1}{4} of a unit to the left on the x-axis.)

As such, this test statistic behaves the same as the test statistic from the first part – both test statistics are large when the number of sunny days is large (evidence for the alternative hypothesis) and small when the number of sunny days is small (evidence for the null hypothesis). This means that, like in the first part, we can use this test statistic to test Pair 2, but not Pair 1.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 24%.



Problem 12.4

The absolute difference between the proportion of cloudy days and \frac{1}{4}

Answer: Pair 1

The test statistic here is the absolute difference between the proportion of cloudy days and \frac{1}{4}. Let q be the proportion of cloudy days. The test statistic is |q - \frac{1}{4}|. The null hypothesis for both pairs states that the probability of a sunny day is \frac{3}{4}, which implies the probability of a cloudy day is \frac{1}{4} (since all days are either sunny or cloudy).

This test statistic is large when the proportion of cloudy days is far from \frac{1}{4} and small when the proportion of cloudy days is close to \frac{1}{4}.

Since Pair 1’s alternative hypothesis is just that the proportion of cloudy days is not \frac{1}{4}, we can use this test statistic to test it! Large values of this test statistic point to the alternative hypothesis and small values point to the null.

On the other hand, Pair 2’s alternative hypothesis is that the proportion of sunny days is greater than \frac{3}{4}, which is the same as the proportion of cloudy days being less than \frac{1}{4}. The issue here is that our test statistic doesn’t involve a direction – a large value implies that the proportion of cloudy days is far from \frac{1}{4}, but we don’t know if that means that there were fewer cloudy days than \frac{1}{4} (evidence for Pair 2’s alternative hypothesis) or more cloudy days than \frac{1}{4} (evidence for Pair 2’s null hypothesis). Since, for Pair 2, this test statistic isn’t set up such that large values point to one hypothesis and small values point to the other, we can’t use this test statistic to test Pair 2.

Therefore, we can use this test statistic to test Pair 1, but not Pair 2.

Aside: This test statistic is equivalent to the absolute difference between the proportion of sunny days and \frac{3}{4}. Try and prove this fact!


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Source: sp24-final — Q16

Problem 13

The table below shows the proportion of apartments of each type in each of three neighborhoods. Note that each column sums to 1.

Type North Park Chula Vista La Jolla
Studio 0.30 0.15 0.40
One bedroom 0.40 0.35 0.30
Two bedroom 0.20 0.25 0.15
Three bedroom 0.10 0.25 0.15


Problem 13.1

Find the total variation distance (TVD) between North Park and Chula Vista. Give your answer as an exact decimal.

Answer: 0.2

To find the TVD, we take the absolute differences between North Park and Chula Vista for all rows, sum them, then cut the result in half.

\dfrac{|0.3 - 0.15| + |0.4 - 0.35| + |0.2 - 0.25| + |0.1 - 0.25|}{2} = \dfrac{0.15 + 0.05 + 0.05 + 0.15}{2} = \dfrac{0.4}{2} = 0.2


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 13.2

Which pair of neighborhoods is most similar in terms of types of housing, as measured by TVD?

Answer: North Park and La Jolla

The TVD between North Park and La Jolla is the lowest between all pairs of two of these three neighborhoods:


Pair TVD
North Park and Chula Vista 0.2
North Park and La Jolla 0.15
Chula Vista and La Jolla 0.25


This implies that the distributions of apartment types for North Park and La Jolla are the most similar.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 13.3

25% of apartments in Little Italy are one bedroom apartments. Based on this information, what is the minimum and maximum possible TVD between North Park and Little Italy? Give your answers as exact decimals.

Minimum: ______ Maximum: ______


Answer:

  • Minimum: 0.15
  • Maximum: 0.65

The minimum TVD is 0.15 because:

  • One-Bedroom Apartments for North Park and Little Italy already have a gap of |0.4 - 0.25| = 0.15
  • A best-possible configuration of the remaining 0.75 of the Little Italy distribution (Studio: 0.3, Two Bed: 0.2, Three Bed: 0.25) produces an additional |0.3 - 0.3| + |0.2 - 0.2| + |0.1 - 0.25| = 0.15 error against North Park.
  • The TVD of this optimal scenario is \frac{0.15 + 0.15}{2} = 0.15.

The maximum TVD is 0.65 because:

  • One-Bedroom Apartments for North Park and Little Italy already have a gap of |0.4 - 0.25| = 0.15
  • The worst-possible configuration of the remaining 0.75 of the Little Italy distribution (Studio: 0.0, Two Bed: 0.0, Three Bed: 0.75) produces an additional |0.3 - 0| + |0.2 - 0| + |0.1 - 0.75| = 1.15 error against North Park.
  • The TVD of this worst scenario is \frac{0.15 + 1.15}{2} = 0.65.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 33%.



Problem 14

Every year, the American Kennel Club holds a Photo Contest for dogs. Eric wants to know whether toy dogs win disproportionately more often than other kinds of dogs. He has collected a sample of 500 dogs that have won the Photo Contest. In his sample, 200 dogs were toy dogs.

Eric also knows the distribution of dog kinds in the population:


Problem 14.1

Select all correct statements of the null hypothesis.

Answer: Options 4 & 5

A null hypothesis is the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. Let’s consider what a potential null hypothesis might look like. A potential null hypothesis would be that there is no difference between the win proportion of toy dogs compared to the proportion of toy dogs in the population.

  • Option 1: We’re not really looking at the distribution of dogs in our sample vs. dogs in our population, rather, we’re looking at whether toy dogs win more than other dogs. In other words, the only factors we’re really consdiering are the proportion of toy dogs to normal dogs, as well as the win percentages of toy dogs to normal dogs; and so the distribution of the population doesn’t really matter. Furthermore, this option makes no reference to win rate of toy dogs.

  • Option 2: This isn’t really even a null hypothesis, but rather more of a description of a test procedure. This option also makes no attempt to reference to win rate of toy dogs.

  • Option 3: This statement doesn’t really make sense in that it is illogical to compare the raw number of toy dogs wins to the number of toy dogs in the population, because the number of toy dogs is always at least the number of toy dogs that win.

  • Option 4: This statement is in line with the null hypothesis.

  • Option 5: This statement is another potential null hypothesis since the proportion of toy dogs in the population is 0.3.

  • Option 6: This statement, although similar to Option 5, would not be a null hypothesis because 0.5 has no relevance to any of the relevant proportions. While it’s true that if the proportion of of toy dogs that win is over 0.5, we could maybe infer that toy dogs win the majority of the times; however, the question is not to determine whether toy dogs win most of the times, but rather if toy dogs win a disproportionately high number of times relative to its population size.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 14.2

Select the correct statement of the alternative hypothesis.

Answer: Option 1

The alternative hypothesis is the hypothesis we’re trying to support, which in this case is that toy dogs happen to win more than other dogs.

  • Option 1: This is in line with our alternative hypothesis, since proving that the null hypothesis underestimates how often toy dogs win means that toy dogs win more than other dogs.

  • Option 2: This is the opposite of what we’re trying to prove.

  • Option 3: We don’t really care too much about the distribution of dog kinds, since that doesn’t help us determine toy dog win rates compared to other dogs.

  • Option 4: Again, we don’t care whether all dogs are chosen according to the probabilities in the null model, instead we care specifically about toy dogs.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 14.3

Select all the test statistics that Eric can use to conduct his hypothesis.

Answer: Option 1 and Option 2

  • Option 1: This option is correct. According to our null hypothesis, we’re trying to compare the proportion of toy dogs win rates to the proportion of toy dogs. Thus taking the proportion of toy dogs in Eric’s sample is a perfectly valid test statistic.

  • Option 2: This option is correct. Since the sample size is fixed at 500, so kowning the count is equivalent to knowing the proportion.

  • Option 3: This option is incorrect. The absolute difference of the sample proportion of toy dogs and 0.3 doesn’t help us because the absolute difference won’t tell us whether or not the sample proportion of toy dogs is lower than 0.3 or higher than 0.3.

  • Option 4: This option is incorrect for the same reasoning as above, but also 0.5 isn’t a relevant number anyways.

  • Option 5: This option is incorrect because TVD measures distance between two categorical distributions, and here we only care about one particular category (not all categories) being the same.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 14.4

Eric decides on this test statistic: the proportion of toy dogs minus the proportion of non-toy dogs. What is the observed value of the test statistic?

Answer: -0.2

For our given sample, the proportion of toy dogs is \frac{200}{500}=0.4 and the proportion of non-toy dogs is \frac{500-200}{500}=0.6, so 0.4 - 0.6 = -0.2.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 14.5

Which snippets of code correctly compute Eric’s test statistic on one simulated sample under the null hypothesis? Select all that apply. The result must be stored in the variable stat. Below are the 5 snippets

Snippet 1:

a = np.random.choice([0.3, 0.7])
b = np.random.choice([0.3, 0.7])
stat = a - b

Snippet 2:

a = np.random.choice([0.1, 0.2, 0.3, 0.2, 0.15, 0.05])
stat = a - (1 - a)

Snippet 3:

a = np.random.multinomial(500, [0.1, 0.2, 0.3, 0.2, 0.15, 0.05]) / 500
stat = a[2] - (1 - a[2])

Snippet 4:

a = np.random.multinomial(500, [0.3, 0.7]) / 500
stat = a[0] - (1 - a[0])

Snippet 5:

a = df.sample(500, replace=True)
b = a[a.get("kind") == "toy"].shape[0] / 500
stat = b - (1 - b)

Answer: Snippet 3 & Snippet 4

  • Snippet 1: This is incorrect because np.random.choice() only chooses values that are either 0.3 or 0.7 which is simply just wrong.

  • Snippet 2: This is wrong because np.random.choice() only chooses from the values within the list. From a sanity check it’s not hard to realize that a should be able to take on more values than the ones in the list.

  • Snippet 3: This option is correct. Recall, in np.random.multinomial(n, [p_1, ..., p_k]), n is the number of experiments, and [p_1, ..., p_k] is a sequence of probability. The method returns an array of length k in which each element contains the number of occurrences of an event, where the probability of the ith event is p_i. In this snippet, np.random.multinomial(500, [0.1, 0.2, 0.3, 0.2, 0.15, 0.05]) generates a array of length 6 (len([0.1, 0.2, 0.3, 0.2, 0.15, 0.05])) that contains the number of occurrences of each kinds of dogs according to the given distribution (the population distribution). We divide the first line by 500 to convert the number of counts in our resulting array into proportions. To access the proportion of toy dogs in our sample, we take the entry with the probability ditribution value of 0.3, which is the third entry in the array or a[2]. To calculate our test statistic we take the proportion of toy dogs minus the proportion of non-toy dogs or a[2] - (1 - a[2])

  • Snippet 4: This option is correct. This approach is similar to the one above except we’re only considering the probability distribution of toy dogs vs non-toy dogs, which is what we wanted in the first place. The rest of the steps are similar to the ones above.

  • Snippet 5: Note that df is simple just a dataframe containing information of the dogs, and may or may not reflect the population distribution of dogs that participate in the photo contest.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 14.6

After simulating, Eric has an array called sim that stores his simulated test statistics, and a variable called obs that stores his observed test statistic.

What should go in the blank to compute the p-value?

np.mean(sim _______ obs)

Answer: Option 4: >=

Note that to calculate the p-value we look for test statistics that are equal to the observed statistic or even further in the direction of the alternative. In this case, if the proportion of the population of toy dogs compared to the rest of the dog population was higher than observed, we’d get a value larger than 0.2, and thus we use >=.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 14.7

Eric’s p-value is 0.03. If his p-value cutoff is 0.01, what does he conclude?

Answer: Option 4: He fails to reject the null

  • Option 1: Note that since our p-value was greater than 0.01, we fail to reject the null.

  • Option 2: We can never “accept” the null hypothesis.

  • Option 3: We didn’t accept the alternative since we failed to reject the null.

  • Option 4: This option is correct because our p-value was larger than our cutoff.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.



Source: su24-final — Q7

Problem 15

You are analyzing the data in stages to see which countries winners come from.

You categorize the countries into four groups: France, Italy, Belgium, and Other. After performing some analysis, you find that the observed distribution of countries of origin for Tour de France stage winners is [0.3, 0.2, 0.1, 0.4]; i.e. 30% of stage winners are French, 20% Italian, etc. Based on census information, the expected population distribution is [0.07, 0.06, 0.01, 0.86]; that is, France’s total population is 7% of the sum of the populations of all countries participating in the Tour, Italy’s is 6%, etc.

You conduct a hypothesis test with the following hypotheses:


Problem 15.1

Which of the following test statistics are appropriate for this hypothesis test? Select all that apply.

Answer: Option 1 and Option 4

  • Option 1 is correct. The absolute difference between the expected proportion of French stage winners and the observed proportion of French stage winners gives the magnitude of the difference in distributions making this a valid test statistic.
  • Option 2 is incorrect. The sum of the differences between the expected population distribution and the observed distribution of stage winners is not a valid test statistic since it indicates a direction to the difference when we only want to know whether these distributions are different.
  • Option 3 is incorrect. The absolute difference between the number of French stage winners and the number of Italian stage winners is not a valid test statistic since the numbers in each population can be different and thus the difference in numbers is not a fair comparison.
  • Option 4 is correct. The sum of the absolute differences between the expected population distribution and the observed distribution of stage winners gives magnitude but does not indicate direction making this a valid test statistic.

Difficulty: ⭐️⭐️

The average score on this problem was 80%.


For the rest of this question, assume that we will be using the Total Variation Distance as our test statistic.


Problem 15.2

Complete the implementation of the simulate and calculate_test_stat functions so that the code below successfully simulates 10,000 test statistics.


expected_dist = [0.07, 0.06, 0.01, 0.86]
observed_dist = [0.3, 0.2, 0.1, 0.4]

def simulate(__(i)__):
    simulated_winners = np.random.__(ii)__(100, __(iii)__)
    return simulated_winners / 100

def calculate_test_stat(__(iv)__, __(v)__):
    return __(vi)__


observed_stat = calculate_test_stat(observed_dist, expected_dist)
simulated_stats = np.array([])
for i in np.arange(10000):
    simulated_dist = simulate(expected_dist)
    simulated_stat = calculate_test_stat(simulated_dist, expected_dist)
    simulated_stats = np.append(simulated_stats, simulated_stat)

Answer:

  • (i): expected_dist
  • (ii): multinomial
  • (iii): expected_dist
  • (iv): simulated_dist
  • (v): expected_dist (or swapped with above)
  • (vi): np.abs(simulated dist - expected dist).sum() / 2

When performing a simulation, we simulated based on the expectation. Thus, the argument for the simulate function (i) should be the expected_dist array.

In this function, we simulate winners based on the expected distribution. So, we want to use np.random.multinomial in (ii) which will take in the number of experiments and expected distribution, ie expected_dist in (iii), which is an array of the probabilities for each of the outcomes.

We are using the total variation distance as the test statistic. The Total Variation Distance (TVD) of two categorical distributions is the sum of the absolute differences of their proportions, all divided by 2. Thus, the arguments of the calculate_test_stat function should be the simulated_distribution in (iv) and the expected_distribution in (v) (or swapped).

In this function, we need to return the TVD which can be calculated as follows: np.abs(simulated dist - expected dist).sum() / 2 in (vi).


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 15.3

Fill in the blank in the following code so that p_value evaluates to the correct p-value for this hypothesis test:

    p_value = np.mean(simulated_stats ___ observed_statistic)

Answer: >=

Recall the p-value is the probability of seeing a result equal to or more extreme than the observed value under the null hypothesis. Since the TVD is our test statistic where greater values indicate a result more extreme that means we want to use >= in the blank to check whether the simulated statistic is equal to or more extreme than the observed statistic.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.



Problem 16

ESPN (a large sports news network) states that the Las Vegas Aces have a 60% chance of winning their upcoming game. You’re curious as to how they came up with this estimate, and you decide to conduct a hypothesis test for the following hypotheses:

In both hypotheses, we are assuming that each game is independent of all other games.

In the 2021 season, the Las Vegas Aces won 22 of their games and lost 9 of their games.


Problem 16.1

Below, we have provided the code necessary to conduct the hypothesis test described above.

stats = np.array([])
for i in np.arange(10000):
    sim = np.random.multinomial(31, [0.6, 0.4])
    stat = fn(sim)
    stats = np.append(stats, stat)
win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

fn is a function that computes a test statistic, given a list or array arr of two elements (the first of which is the number of wins, and the second of which is the number of losses). You can assume that neither element of arr is equal to 0.

Below, we define 5 possible test statistics fn.

Option 1:

def fn(arr):
    return arr[0] / arr[1]

Option 2:

def fn(arr):
    return arr[0]

Option 3:

def fn(arr):
    return np.abs(arr[0] - arr[1])

Option 4:

def fn(arr):
    return arr[0] - arr[1]

Option 5:

def fn(arr):
    return arr[1] - arr[0]

Which of the above functions fn would be valid test statistics for this hypothesis test and p-value calculation? Select all that apply.

Answer: Options 1, 2, and 4

In the code provided to us, stats is an array containing 10,000 p-values generated by the function fn (note that we are appending stat to stats, and in the line before that we have stat = fn(sim)). In the very last line of the code provided, we have:

win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

If we look closely, we see that we are computing the p-value by computing the proportion of simulated test statistics that were greater than or equal to (>=) the observed statistic. Since a p-value is computed as the proportion of simulated test statistics that were as or more extreme than the observed statistic, here it must mean that “big” test statistics are more extreme.

Remember, the direction that is “extreme” is determined by our alternative hypothesis. Here, the alternative hypothesis is that the Las Vegas Aces win each game with a probability above 60%. As such, the test statistic(s) we choose must be large when the probability that the Aces win a game is high, and small when the probability that the Aces win a game is low. With this all in mind, we can take a look at the 5 options, remembering that arr[0] is the number of simulated wins and arr[1] is the number of simulated losses in a season of 31 games. This means that when the Aces win more than they lose, arr[0] > arr[1], and when they lose more than they win, arr[0] < arr[1].

  • Option 1: Here, our test statistic is the ratio of wins to losses, i.e. arr[0] / arr[1]. If the Aces win a lot, the numerator will be larger than the denominator, so this ratio will be large. If the Aces lose a lot, the numerator will be smaller than the denominator, and so this ratio will be small. This is what we want!
  • Option 2: Here, our test statistic is the number of wins, i.e. arr[0]. If the Aces win a lot, this number will be large, and if the Aces lose a lot, this number will be small. This is what we want!
  • Option 3: Here, our test statistic is the absolute value of the number of wins minus the number of losses, i.e. np.abs(arr[0] - arr[1]). If the Aces win a lot, then arr[0] - arr[1] will be large, and so will np.abs(arr[0] - arr[1]). This seems fine. However, if the Aces lose a lot, then arr[0] - arr[1] will be small (negative), but np.abs(arr[0] - arr[1]) will still be large and positive. This test statistic doesn’t allow us to differentiate when the Aces win a lot or lose a lot, so we can’t use it as a test statistic for our alternative hypothesis.
  • Option 4: From the explanation of Option 3, we know that when the Aces win a lot, arr[0] - arr[1] is large. Furthermore, when the Aces lose a lot, arr[0] - arr[1] is small (negative numbers are small in this context). This works!
  • Option 5: arr[1] - arr[0] is the opposite of arr[0] - arr[1] in Option 4. When the Aces win a lot, arr[1] - arr[0] is small (negative), and when the Aces lose a lot, arr[1] - arr[0] is large (positive). This is the opposite of what we want, so Option 5 does not work.

Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 16.2

The empirical distribution of one of the 5 test statistics presented in the previous subpart is shown below. To draw the histogram, we used the argument bins=np.arange(-10, 25).

Which test statistic does the above empirical distribution belong to?

Answer: Option 4

The distribution visualized in the histogram has the following unique values: -9, -7, -5, -3, …, 17, 19, 21, 23. Crucially, the test statistic whose distribution we’ve visualized can both be positive and negative. Right off the bat, we can eliminate Options 1, 2, and 3:

  • Option 1: Invalid. Option 1 is computed by dividing the number of wins (arr[0]) by the number of losses (arr[1]), and that quotient will always be a non-negative number.
  • Option 2: Invalid, since the number of wins (arr[0]) will always be a non-negative number.
  • Option 3: Invalid, since the absolute value of any real number (np.abs(arr[0] - arr[1]), in this case) will always be a non-negative number.

Now, we must decide between Option 4, whose test statistic is “wins minus losses” (arr[0] - arr[1]), and Option 5, whose test statistic is “losses minus wins” (arr[1] - arr[0]).

First, let’s recap how we’re simulating. In the code provided in the previous subpart, we have the line sim = np.random.multinomial(31, [0.6, 0.4]). Each time we run this line, sim will be set to an array with two elements, the first of which we interpret as the number of simulated wins and the second of which we interpret as the number of simulated losses in a 31 game season. The first number in sim will usually be larger than the second number in sim, since the chance of a win (0.6) is larger than the chance of a loss (0.4). As such, When we compute fn(sim) in the following line, the difference between the wins and losses should typically be positive.

Back to our distribution. Note that the distribution provided in this subpart is centered at a positive number, around 7. Since the difference between wins and losses will typically be positive, it appears that we’ve visualized the distribution of the difference between wins and losses (Option 4). If we instead visualized the difference between losses and wins, the distribution should be centered at a negative number, but that’s not the case.

As such, the correct answer is Option 4.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 16.3

Consider the function fn_plus defined below.

def fn_plus(arr):
    return fn(arr) + 31

True or False: If fn is a valid test statistic for the hypothesis test and p-value calculation presented at the start of the problem, then fn_plus is also a valid test statistic for the hypothesis test and p-value calculation presented at the start of the problem.

Answer: True

All fn_plus is doing is adding 31 to the output of fn. If we think in terms of pictures, the shape of the distribution of fn_plus looks the same as the distribution of fn, just moved to the right by 31 units. Since the distribution’s shape is no different, the proportion of simulated test statistics that are greater than the observed test statistic is no different either, and so the p-value we calculate with fn_plus is the same as the one we calculate with fn.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 16.4

Below, we present the same code that is given at the start of the problem. (Remember to keep the data description from the top of the exam open in another tab!)

stats = np.array([])
for i in np.arange(10000):
    sim = np.random.multinomial(31, [0.6, 0.4])
    stat = fn(sim)
    stats = np.append(stats, stat)

win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

Below are four possible replacements for the line sim = np.random.multinomial(31, [0.6, 0.4]).

Option 1:

def with_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=True))

sim = [with_rep(), 31 - with_rep()]

Option 2:

def with_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=True))

w = with_rep()
sim = [w, 31 - w]

Option 3:

def without_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=False))

sim = [without_rep(), 31 - without_rep()]

Option 4:

def perm():
    won = plum.get('Won')
    return np.count_nonzero(np.random.permutation(won))

w = perm()
sim = [w, 31 - w]

Which of the above four options could we replace the line sim = np.random.multinomial(plum.shape[0], [0.6, 0.4]) with and still perform a valid hypothesis test for the hypotheses stated at the start of the problem?

Answer: Option 2

The line sim = np.random.multinomial(plum.shape[0], [0.6, 0.4]) assigns sim to an array containing two numbers such that:

  • The numbers are randomly chosen each time the line is run
  • The numbers always add up to 31

We need to select an option that also creates such an array (or list, in this case). Note that won = plum.get('Won'), a line that is common to all four options, assigns won to a Series with 31 elements, each of which is either True or False (corresponding to the wins and losses that the Las Vegas Aces earned in their season).

Let’s take a look at the line np.count_nonzero(np.random.choice(won, 31, replace=True)), common to the first two options. Here, we are randomly selecting 31 elements from the Series won, with replacement, and counting the number of Trues (since with np.count_nonzero, False is counted as 0). Since we are making our selections with replacement, each selected element has a \frac{22}{31} chance of being True and a \frac{9}{31} chance of being False (since won has 22 Trues and 9 Falses). As such, np.count_nonzero(np.random.choice(won, 31, replace=True)) can be any integer between 0 and 31, inclusive.

Note that if we select without replacement (replace=False) as Option 3 would like us to, then all 31 selected elements would be the same as the 31 elements in won. As a result, np.random.choice(won, 31, replace=False) will always have 22 Trues, just like won, and np.count_nonzero(np.random.choice(won, 31, replace=True)) will always return 22. That’s not random, and so that’s not quite what we’re looking for.

With this all in mind, let’s look at the four options.

  • Option 1: Here, each time we call with_rep(), we get a random number between 0 and 31 (inclusive), corresponding to the (random) number of simulated wins. Then, we are assigning sim to be [with_rep(), 31 - with_rep()]. However, it’s not guaranteed that the two calls to with_rep return the same number of wins, so it’s not guaranteed that sum(sim) is 31. Option 1, then, is invalid.
  • Option 2: Correct, as we’ll explain below.
  • Option 3: As mentioned above, Option 3 uses replace=False, and so without_rep() is always 22 and sim is always [22, 9]. The outcome is not random.
  • Option 4: Here, perm() always returns the same number, 22. This is because all we are doing is shuffling the entries in the won Series, but we aren’t changing the number of wins (Trues) and losses (Falses). As a result, w is always 22 and sim is always [22, 9], making this non-random, just like in Option 3.

By the process of elimination, Option 2 must be the correct choice. It is similar to Option 1, but it only calls with_rep once and “saves” the result to the name w. As a result, w is random, and w and 31 - w are guaranteed to sum to 31.

⚠️ Note: It turns out that none of these options run a valid hypothesis test, since the null hypothesis was that the Las Vegas Aces win 60% of their games but none of these simulation strategies use 60% anywhere (instead, they use the observation that the Aces actually won 22 games). However, this subpart was about the sampling strategies themselves, so this mistake from our end doesn’t invalidate the problem.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 16.5

Consider again the four options presented in the previous subpart.

In which of the four options is it guaranteed that sum(sim) evaluates to 31? Select all that apply.

Answers: Options 2, 3, and 4

  • Option 1: As explained in the solution to the previous subpart, if the two calls to with_rep evaluate to different numbers (entirely possible, since it is random), then sum(sim) will not be 31.
  • Option 2: Here, sim is defined in terms of some w. Specifically, w is some number between 0 and 31 and sim is [w, 31 - w], so sum(sim) is the same as w + 31 - w, which is always 31.
  • Option 3: In Option 3, sim is always [22, 9], and sum(sim) is always 31.
  • Option 4: Same as Option 3.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 17

ESPN (a large sports news network) states that the Las Vegas Aces have a 60% chance of winning their upcoming game. You’re curious as to how they came up with this estimate, and you decide to conduct a hypothesis test for the following hypotheses:

In both hypotheses, we are assuming that each game is independent of all other games.

In the 2021 season, the Las Vegas Aces won 22 of their games and lost 9 of their games.


Problem 17.1

Below, we have provided the code necessary to conduct the hypothesis test described above.

stats = np.array([])
for i in np.arange(10000):
    sim = np.random.multinomial(31, [0.6, 0.4])
    stat = fn(sim)
    stats = np.append(stats, stat)
win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

fn is a function that computes a test statistic, given a list or array arr of two elements (the first of which is the number of wins, and the second of which is the number of losses). You can assume that neither element of arr is equal to 0.

Below, we define 5 possible test statistics fn.

Option 1:

def fn(arr):
    return arr[0] / arr[1]

Option 2:

def fn(arr):
    return arr[0]

Option 3:

def fn(arr):
    return np.abs(arr[0] - arr[1])

Option 4:

def fn(arr):
    return arr[0] - arr[1]

Option 5:

def fn(arr):
    return arr[1] - arr[0]

Which of the above functions fn would be valid test statistics for this hypothesis test and p-value calculation? Select all that apply.

Answer: Options 1, 2, and 4

In the code provided to us, stats is an array containing 10,000 p-values generated by the function fn (note that we are appending stat to stats, and in the line before that we have stat = fn(sim)). In the very last line of the code provided, we have:

win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

If we look closely, we see that we are computing the p-value by computing the proportion of simulated test statistics that were greater than or equal to (>=) the observed statistic. Since a p-value is computed as the proportion of simulated test statistics that were as or more extreme than the observed statistic, here it must mean that “big” test statistics are more extreme.

Remember, the direction that is “extreme” is determined by our alternative hypothesis. Here, the alternative hypothesis is that the Las Vegas Aces win each game with a probability above 60%. As such, the test statistic(s) we choose must be large when the probability that the Aces win a game is high, and small when the probability that the Aces win a game is low. With this all in mind, we can take a look at the 5 options, remembering that arr[0] is the number of simulated wins and arr[1] is the number of simulated losses in a season of 31 games. This means that when the Aces win more than they lose, arr[0] > arr[1], and when they lose more than they win, arr[0] < arr[1].

  • Option 1: Here, our test statistic is the ratio of wins to losses, i.e. arr[0] / arr[1]. If the Aces win a lot, the numerator will be larger than the denominator, so this ratio will be large. If the Aces lose a lot, the numerator will be smaller than the denominator, and so this ratio will be small. This is what we want!
  • Option 2: Here, our test statistic is the number of wins, i.e. arr[0]. If the Aces win a lot, this number will be large, and if the Aces lose a lot, this number will be small. This is what we want!
  • Option 3: Here, our test statistic is the absolute value of the number of wins minus the number of losses, i.e. np.abs(arr[0] - arr[1]). If the Aces win a lot, then arr[0] - arr[1] will be large, and so will np.abs(arr[0] - arr[1]). This seems fine. However, if the Aces lose a lot, then arr[0] - arr[1] will be small (negative), but np.abs(arr[0] - arr[1]) will still be large and positive. This test statistic doesn’t allow us to differentiate when the Aces win a lot or lose a lot, so we can’t use it as a test statistic for our alternative hypothesis.
  • Option 4: From the explanation of Option 3, we know that when the Aces win a lot, arr[0] - arr[1] is large. Furthermore, when the Aces lose a lot, arr[0] - arr[1] is small (negative numbers are small in this context). This works!
  • Option 5: arr[1] - arr[0] is the opposite of arr[0] - arr[1] in Option 4. When the Aces win a lot, arr[1] - arr[0] is small (negative), and when the Aces lose a lot, arr[1] - arr[0] is large (positive). This is the opposite of what we want, so Option 5 does not work.

Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 17.2

The empirical distribution of one of the 5 test statistics presented in the previous subpart is shown below. To draw the histogram, we used the argument bins=np.arange(-10, 25).

Which test statistic does the above empirical distribution belong to?

Answer: Option 4

The distribution visualized in the histogram has the following unique values: -9, -7, -5, -3, …, 17, 19, 21, 23. Crucially, the test statistic whose distribution we’ve visualized can both be positive and negative. Right off the bat, we can eliminate Options 1, 2, and 3:

  • Option 1: Invalid. Option 1 is computed by dividing the number of wins (arr[0]) by the number of losses (arr[1]), and that quotient will always be a non-negative number.
  • Option 2: Invalid, since the number of wins (arr[0]) will always be a non-negative number.
  • Option 3: Invalid, since the absolute value of any real number (np.abs(arr[0] - arr[1]), in this case) will always be a non-negative number.

Now, we must decide between Option 4, whose test statistic is “wins minus losses” (arr[0] - arr[1]), and Option 5, whose test statistic is “losses minus wins” (arr[1] - arr[0]).

First, let’s recap how we’re simulating. In the code provided in the previous subpart, we have the line sim = np.random.multinomial(31, [0.6, 0.4]). Each time we run this line, sim will be set to an array with two elements, the first of which we interpret as the number of simulated wins and the second of which we interpret as the number of simulated losses in a 31 game season. The first number in sim will usually be larger than the second number in sim, since the chance of a win (0.6) is larger than the chance of a loss (0.4). As such, When we compute fn(sim) in the following line, the difference between the wins and losses should typically be positive.

Back to our distribution. Note that the distribution provided in this subpart is centered at a positive number, around 7. Since the difference between wins and losses will typically be positive, it appears that we’ve visualized the distribution of the difference between wins and losses (Option 4). If we instead visualized the difference between losses and wins, the distribution should be centered at a negative number, but that’s not the case.

As such, the correct answer is Option 4.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 17.3

Consider the function fn_plus defined below.

def fn_plus(arr):
    return fn(arr) + 31

True or False: If fn is a valid test statistic for the hypothesis test and p-value calculation presented at the start of the problem, then fn_plus is also a valid test statistic for the hypothesis test and p-value calculation presented at the start of the problem.

Answer: True

All fn_plus is doing is adding 31 to the output of fn. If we think in terms of pictures, the shape of the distribution of fn_plus looks the same as the distribution of fn, just moved to the right by 31 units. Since the distribution’s shape is no different, the proportion of simulated test statistics that are greater than the observed test statistic is no different either, and so the p-value we calculate with fn_plus is the same as the one we calculate with fn.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 17.4

Below, we present the same code that is given at the start of the problem.

stats = np.array([])
for i in np.arange(10000):
    sim = np.random.multinomial(31, [0.6, 0.4])
    stat = fn(sim)
    stats = np.append(stats, stat)

win_p_value = np.count_nonzero(stats >= fn([22, 9])) / 10000

Below are four possible replacements for the line sim = np.random.multinomial(31, [0.6, 0.4]).

Option 1:

def with_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=True))

sim = [with_rep(), 31 - with_rep()]

Option 2:

def with_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=True))

w = with_rep()
sim = [w, 31 - w]

Option 3:

def without_rep():
    won = plum.get('Won')
    return np.count_nonzero(np.random.choice(won, 31, replace=False))

sim = [without_rep(), 31 - without_rep()]

Option 4:

def perm():
    won = plum.get('Won')
    return np.count_nonzero(np.random.permutation(won))

w = perm()
sim = [w, 31 - w]

Which of the above four options could we replace the line sim = np.random.multinomial(plum.shape[0], [0.6, 0.4]) with and still perform a valid hypothesis test for the hypotheses stated at the start of the problem?

Answer: Option 2

The line sim = np.random.multinomial(plum.shape[0], [0.6, 0.4]) assigns sim to an array containing two numbers such that:

  • The numbers are randomly chosen each time the line is run
  • The numbers always add up to 31

We need to select an option that also creates such an array (or list, in this case). Note that won = plum.get('Won'), a line that is common to all four options, assigns won to a Series with 31 elements, each of which is either True or False (corresponding to the wins and losses that the Las Vegas Aces earned in their season).

Let’s take a look at the line np.count_nonzero(np.random.choice(won, 31, replace=True)), common to the first two options. Here, we are randomly selecting 31 elements from the Series won, with replacement, and counting the number of Trues (since with np.count_nonzero, False is counted as 0). Since we are making our selections with replacement, each selected element has a \frac{22}{31} chance of being True and a \frac{9}{31} chance of being False (since won has 22 Trues and 9 Falses). As such, np.count_nonzero(np.random.choice(won, 31, replace=True)) can be any integer between 0 and 31, inclusive.

Note that if we select without replacement (replace=False) as Option 3 would like us to, then all 31 selected elements would be the same as the 31 elements in won. As a result, np.random.choice(won, 31, replace=False) will always have 22 Trues, just like won, and np.count_nonzero(np.random.choice(won, 31, replace=True)) will always return 22. That’s not random, and so that’s not quite what we’re looking for.

With this all in mind, let’s look at the four options.

  • Option 1: Here, each time we call with_rep(), we get a random number between 0 and 31 (inclusive), corresponding to the (random) number of simulated wins. Then, we are assigning sim to be [with_rep(), 31 - with_rep()]. However, it’s not guaranteed that the two calls to with_rep return the same number of wins, so it’s not guaranteed that sum(sim) is 31. Option 1, then, is invalid.
  • Option 2: Correct, as we’ll explain below.
  • Option 3: As mentioned above, Option 3 uses replace=False, and so without_rep() is always 22 and sim is always [22, 9]. The outcome is not random.
  • Option 4: Here, perm() always returns the same number, 22. This is because all we are doing is shuffling the entries in the won Series, but we aren’t changing the number of wins (Trues) and losses (Falses). As a result, w is always 22 and sim is always [22, 9], making this non-random, just like in Option 3.

By the process of elimination, Option 2 must be the correct choice. It is similar to Option 1, but it only calls with_rep once and “saves” the result to the name w. As a result, w is random, and w and 31 - w are guaranteed to sum to 31.

⚠️ Note: It turns out that none of these options run a valid hypothesis test, since the null hypothesis was that the Las Vegas Aces win 60% of their games but none of these simulation strategies use 60% anywhere (instead, they use the observation that the Aces actually won 22 games). However, this subpart was about the sampling strategies themselves, so this mistake from our end doesn’t invalidate the problem.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 17.5

Consider again the four options presented in the previous subpart.

In which of the four options is it guaranteed that sum(sim) evaluates to 31? Select all that apply.

Answers: Options 2, 3, and 4

  • Option 1: As explained in the solution to the previous subpart, if the two calls to with_rep evaluate to different numbers (entirely possible, since it is random), then sum(sim) will not be 31.
  • Option 2: Here, sim is defined in terms of some w. Specifically, w is some number between 0 and 31 and sim is [w, 31 - w], so sum(sim) is the same as w + 31 - w, which is always 31.
  • Option 3: In Option 3, sim is always [22, 9], and sum(sim) is always 31.
  • Option 4: Same as Option 3.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 18

We collect data on the play times of 100 games of Chutes and Ladders (sometimes known as Snakes and Ladders).


Problem 18.1

We use our collected data to construct a 95% CLT-based confidence interval for the average play time of a game of Chutes and Ladders. This 95% confidence interval is [26.47, 28.47]. For the 100 games for which we collected data, what is the mean and standard deviation of the play times?

Answer: mean = 27.47 and SD = 5

One of the key properties of the normal distribution is that about 95% of values lie within 2 standard deviations of the mean. The Central Limit Theorem states that the distribution of the sample mean is roughly normal, which means that to create this CLT-based 95% confidence interval, we used the 2 standard deviations rule.

What we’re given, then, is the following:

\begin{align*} \text{Sample Mean} + 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 28.47 \\ \text{Sample Mean} - 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 26.47 \end{align*}

The sample mean is halfway between 26.47 and 28.47, which is 27.47. Substituting this into the first equation gives us

\begin{align*}27.47 + 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 28.47\\2 \cdot \text{SD of Distribution of Possible Sample Means} &= 1 \\ \text{Distribution of Possible Sample Means} &= 0.5\end{align*}

It can be tempting to conclude that the sample standard deviation is 0.5, but it’s not – the SD of the sample mean’s distribution is 0.5. Remember, the SD of the sample mean’s distribution is given by the square root law:

\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \approx \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}

We don’t know the population SD, so we’ve used the sample SD as an estimate. As such, we have that

\text{SD of Distribution of Possible Sample Means} = 0.5 = \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} = \frac{\text{Sample SD}}{\sqrt{100}}

So, \text{Sample SD} = 0.5 \cdot \sqrt{100} = 0.5 \cdot 10 = 5.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 18.2

Does the CLT say that the distribution of play times of the 100 games is roughly normal?

Answer: No

The Central Limit Theorem states that the distribution of the sample mean or the sample sum is roughly normal. The distribution of play times is a sample of size 100 drawn from the population of play times; the Central Limit Theorem doesn’t say anything about a population or any one sample.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



Source: wi23-final — Q16

Problem 19

We collect data on the play times of 100 games of Chutes and Ladders (sometimes known as Snakes and Ladders) and want to use this data to perform a hypothesis test.


Problem 19.1

Which of the following pairs of hypotheses can we test using this data?

Option 1: Null Hypothesis: In a random sample of Chutes and Ladders games, the average play time is 30 minutes. Alternative Hypothesis: In a random sample of Chutes and Ladders games, the average play time is not 30 minutes.

Option 2: Null Hypothesis: In a random sample of Chutes and Ladders games, the average play time is not 30 minutes. Alternative Hypothesis: In a random sample of Chutes and Ladders games, the average play time is 30 minutes

Option 3: Null Hypothesis: A game of Chutes and Ladders takes, on average, 30 minutes to play. Alternative Hypothesis: A game of Chutes and Ladders does not take, on average, 30 minutes to play.

Option 4: Null Hypothesis: A game of Chutes and Ladders does not take, on average, 30 minutes to play. Alternative Hypothesis: A game of Chutes and Ladders takes, on average, 30 minutes to play.

Answer: Option 3

Option 3: is the correct answer because the Null Hypothesis can be applicable to the real world, and thus simulated, and has the test statistic “equal” to our prediction of 30 minutes. The Alternative Hypothesis is also correctly different from the Null Hypothesis by saying the test statistic is “not equal” to our prediction of 30 minutes.

Option 1: We want the Null Hypothesis or Alternative Hypothesis to be applicable to the real world, which means that by having the start “In a random sample…” we are discrediting this in the real world.

Option 2: Like Option 1, we want the Null Hypothesis or Alternative Hypothesis to be applicable to the real world, which means that by having the start “In a random sample…” we are discrediting this in the real world.

Option 4: This answer is wrong because the Null Hypothesis should be focused on figuring out the positive test statistic, in this case average. In other words, let u be the average time to play Chutes and Ladders and let u<sub>0<\sub> be 30 minutes. The Null Hypothesis should be u = u<sub>0<\sub> and the Alternative Hypothesis should be something different, in this case: u != u<sub>0<\sub>.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.


Problem 19.2

We use our collected data to construct a 95% CLT-based confidence interval for the average play time of a game of Chutes and Ladders. This 95% confidence interval is [26.47, 28.47]. For the 100 games for which we collected data, what is the mean and standard deviation of the play times?

Answer: mean = 27.47 and SD = 5

One of the key properties of the normal distribution is that about 95% of values lie within 2 standard deviations of the mean. The Central Limit Theorem states that the distribution of the sample mean is roughly normal, which means that to create this CLT-based 95% confidence interval, we used the 2 standard deviations rule.

What we’re given, then, is the following:

\begin{align*} \text{Sample Mean} + 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 28.47 \\ \text{Sample Mean} - 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 26.47 \end{align*}

The sample mean is halfway between 26.47 and 28.47, which is 27.47. Substituting this into the first equation gives us

\begin{align*}27.47 + 2 \cdot \text{SD of Distribution of Possible Sample Means} &= 28.47\\2 \cdot \text{SD of Distribution of Possible Sample Means} &= 1 \\ \text{Distribution of Possible Sample Means} &= 0.5\end{align*}

It can be tempting to conclude that the sample standard deviation is 0.5, but it’s not – the SD of the sample mean’s distribution is 0.5. Remember, the SD of the sample mean’s distribution is given by the square root law:

\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} \approx \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}}

We don’t know the population SD, so we’ve used the sample SD as an estimate. As such, we have that

\text{SD of Distribution of Possible Sample Means} = 0.5 = \frac{\text{Sample SD}}{\sqrt{\text{Sample Size}}} = \frac{\text{Sample SD}}{\sqrt{100}}

So, \text{Sample SD} = 0.5 \cdot \sqrt{100} = 0.5 \cdot 10 = 5.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 19.3

Does the CLT say that the distribution of play times of the 100 games is roughly normal?

Answer: No

The Central Limit Theorem states that the distribution of the sample mean or the sample sum is roughly normal. The distribution of play times is a sample of size 100 drawn from the population of play times; the Central Limit Theorem doesn’t say anything about a population or any one sample.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.


Problem 19.4

Of the two hypotheses you selected in part (a), which one is better supported by the data?

Answer: Alternative Hypothesis

To test the null hypothesis, we check whether 30 is in the confidence interval we constructed. 30 is not between 26.47 and 28.47, so we reject the null hypothesis that the average play time is 30 minutes.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Source: wi24-final — Q11

Problem 20

Aladár Gerevich is a Hungarian fencer who is one of only two men to win Olympic medals 28 years apart. He earned 10 Olympic medals in total throughout his career: 7 gold, 1 silver, and 2 bronze. The table below shows the distribution of medal types for Aladár Gerevich, as well as a few other athletes who also earned 10 Olympic medals.


Problem 20.1

Which type of data visualization is most appropriate to compare two athlete’s medal distributions?

Answer: overlaid bar chart

Here, we are plotting the data of 2 athletes, comparing the medal distributions. Gold, silver, and bronze medals are categorical variables, while the proportion of these won is a quantitative value. A bar chart is the only kind of plot that involves categorical data with quantitative data. Since there are 2 athletes, the most appropriate plot is an overlaid bar chart. The overlapping bars would help compare the difference in their distributions.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 20.2

Among the other athletes in the table above, whose medal distribution has the largest total variation distance (TVD) to Aladár Gerevich’s distribution?

Answer: Franziska van Almsick

The Total Variation Distance (TVD) of two categorical distributions is the sum of the absolute differences of their proportions, all divided by 2. We can apply the TVD formula to these distributions: The TVD between Katie Ledecky and Aladar Gerevich is given by \frac{1}{2} \cdot (|0.7 - 0.7| + |0.1 - 0.3| + |0.2 - 0|) = \frac{0.4}{2} = 0.2. The TVD between Alexander Dityatin and Aladar Gerevich is given by \frac{1}{2} \cdot (|0.7 - 0.3| + |0.1 - 0.6| + |0.2 - 0.1|) = \frac{1}{2} = 0.5. And finally, the TVD between Franziska van Almsick and Aladar Gerevich is given by \frac{1}{2} \cdot (|0.7 - 0| + |0.1 - 0.4| + |0.2 - 0.6|) = \frac{1.4}{2} = 0.7. So, Franziska van Almsick has the largest TVD to Gerevich’s distribution.


Difficulty: ⭐️

The average score on this problem was 92%.


Problem 20.3

Suppose Pallavi earns 10 Olympic medals in such a way that the TVD between Pallavi’s medal distribution and Aladár Gerevich’s medal distribution is as large as possible. What is Pallavi’s medal distribution?

Answer: x=0, y=1, z=0

Intuitively, can maximize the TVD between the distributions by putting all of Pallavi’s medals in the category which Gerevich won the least of, so x = 0, y = 1, z = 0. Moving any of these medals to another category would decrease the TVD, since that would mean that all of Pallavi’s medal proportions would get closer to Gerevich’s (Silver is decreasing, getting closer, and Gold and Bronze are increasing, which makes them closer as well).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 20.4

More generally, suppose medal_dist is an array of length three representing an athlete’s medal distribution. Which of the following expressions gives the maximum possible TVD between medal_dist and any other distribution?

Answer: 1 - medal_dist.min()

Similar to part c, we know that the TVD is maximized by placing all the medals of competitor A into the category in which competitor B has the lowest proportion of medals. If we place all of competitor A’s medals into this bin, the difference between the two distributions for this variable will be 1 - medal_dist.min() In the other bins, competitor A has no medals (making all their values 0), and competitor B has the remainder of their medals, which is 1 - medal_dist.min(). So, in total, the TVD is given by \frac{1}{2} \cdot 2 \cdot 1 - medal_dist.min() = 1 - medal_dist.min().


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.



Source: wi25-final — Q7

Problem 21

Professor Minerva McGonagall, head of Gryffindor, may also be awarding house points unfairly. For this question, we’ll assume that all four of the houses contain the same number of students, and we’ll investigate whether McGonagall awards points equally to all four houses.

Below is the distribution of points that Professor McGonagall awarded during the last academic year.

You want to test the following hypotheses:


Problem 21.1

Which of the following test statistics is appropriate for this hypothesis test? Select all that apply.

Answer: Option 2 and 5.

This problem is trying to find test statistics that can be used to distinguish when the data is better supported by the alternative. Since the alternative hypothesis simply states “is not uniform across all of the houses”, it is not crucial to look for individual differences between houses but rather the general relationship of points awarded to all houses.

Option 1: This option only tells us information about Gryffindor but doesn’t tell us anything about inequalities between other houses. For example,e if Gryffindor received 25% of the points and Slytherin received the other 75% of the points we would not be able to tell this apart from the case when all houses received 25% of the points.

Option 2: When the points are distributed normally we would expect that all the houses receive about the same amount and therefore the difference of points would be near 0. However, if one house is gaining more points than the rest of the houses then the number of points between the top and lowest house we be greater than 0. Therefore we are “measuring” the alternative.

Option 3: This test statistic measures the sum of the squared differences in proportions between McGonagall’s distribution and [0.5, 0.5, 0.5, 0.5]. However, [0.5, 0.5, 0.5, 0.5] does not represent a valid probability distribution because the total sum exceeds 1. As a result, this test statistic is not meaningful in assessing whether the point distribution is uniform across houses.

Option 4: When we measure the sum of the differences, the information on “how close” the data is to the alternative can be unseen. For example, a distribution of [0.25, 0.25, 0.25, 0.25] is what we’d expect under the null and a distribution of [0.0, 0.5, 0.0, 0.5] would support the alternative. However, using this test statistic we would see a result of 0 therefore not differentiating them.

Option 5: Under the null hypothesis [0.25, 0.25, 0.25, 0.25] is the “expected” distribution. Additionally, since the alternative hypothesis states that McGongall’s assignments are not uniform (aka is the distribution of points non-uniform), a house supports the alternative if its points are significantly less than or greater than 0.25. Squaring the differences allows the test statistic to be greater in either case and therefore would be a valid statistic.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


For the rest of this problem, we will use the following test statistic:

The sum of the absolute differences in proportions between McGonagall’s distribution and [0.25, 0.25, 0.25, 0.25].


Problem 21.2

Choose the correct way to implement the function calculate_test_stat, which takes in two distributions as arrays and returns the value of this test statistic.

    def calculate_test_stat(dist_1, dist_2):
        return _____

Answer: sum(np.abs(dist_1 - dist_2))

A valid test statistic in this problem would find how far dist_1 differs from dist_2. Since we’re looking at “how different”) the distributions are, we need to take the absolute value (aka a measure of distance/difference) and then add them up. np.abs() needs to be used over abs() because dist_1 and dist_2 are arrays and the built-in function abs only works for individual numbers.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 21.3

(10 pts) Fill in the blanks in the code below so that simulated_ts is an array containing 10,000 simulated values of the test statistic under the null. Note that your answer to blank (c) is used in more than one place in the code.

    mc_gon = np.arange(__(a)__) # Careful: np.arange, not np.array!
    null = np.array([0.25, 0.25, 0.25, 0.25])
    observed_ts = calculate_test_stat(__(b)__)

    simulated_ts = np.array([])

    for i in np.arange(10000):
        sim = np.random.multinomial(__(c)__, __(d)__) / __(c)__
        one_simulated_ts = calculate_test_stat(__(e)__)
        simulated_ts = np.append(simulated_ts, one_simulated_ts)

What goes in blank (a)?

Answer: 0.1, 0.5, 0.1 (or equivalent)

Based on the distribution shown in the bar chart above, we want our resulting array to contain the proportions [0.4, 0.2, 0.3, 0.1] (40/100 for Gryffindor, 20/100 for Hufflepuff, 30/100 for Ravenclaw, and 10/100 for Slytherin). Note that the order of these proportions does not matter because: 1) we are calculating the absolute difference between each value and the null proportion (0.25), and 2) we will sum all the differences together.

Since these proportions are incrementally increasing, we can use np.arange() to construct mc_gon. There are multiple correct approaches to this problem, as long as the resulting array contains all four proportions. Some alternative correct approaches include: - np.arange(0.4, 0.0, -0.1) - np.arange(0.1, 0.41, 0.1) (The middle argument can be any value greater than 0.4 and less than or equal to 0.5)


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 21.4

What goes in blank (b)?

Answer: null, mc_gon

Note that the order of mc_gon and null does not matter, as calculate_test_stat calculates the absolute difference between the two.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 21.5

What goes in blank (c)?

Answer: 100

Blank (c) represents the total number of trials in each simulated sample. Using 100 ensures that each sample is large enough to approximate the expected proportions while maintaining computational efficiency. Additionally, (c) is used to divide all values, converting counts into proportions.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.


Problem 21.6

What goes in blank (d)?

Answer: null

Blank (d) is null because each simulated sample is generated under the null hypothesis. This means the probabilities used in np.random.multinomial should match the expected proportions from the null distribution.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 21.7

What goes in blank (e)?

Answer: sim, null

Note that the order of null and sim does not matter, as calculate_test_stat calculates the absolute difference between the two.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 21.8

Fill in the blank so that reject_null evaluates to True if we reject the null hypothesis at the 0.05 significance level, and False otherwise.

    reject_null = __(f)__

Answer: (simulated_ts >= observed_ts).mean() <= 0.05

reject_null should evaluate to a boolean statement therefore we must test whether our p-value) is less than or equal to 0.05. Taking the mean of (simulated_ts >= observed_ts) tells us the proportion of simulated test statistics that are equal to the value that was observed in the data or is even further in the direction of the alternative.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 21.9

Your friend performs the same hypothesis test as you, but uses the total variation distance (TVD) as their test statistic instead of the one described in the problem. Which of the following statements is true?

Answer: Option 4.

TVD) is calculated by taking the sum of the absolute differences of two proportions, all divided by 2. Therefore the only difference between TVD and our test statistic is the fact that TVD is divided by 2 (which would make it smaller). Meaning that the reason it is smaller is not related to magnitude or direction.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.