# Discussion 6: Hypothesis Testing

The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.

## Problem 1

For Problem 1 we will be using this DataFrame:

The American Kennel Club (AKC) organizes information about dog breeds. We’ve loaded their dataset into a DataFrame called df. The index of df contains the dog breed names as str values.

The columns are:

• 'kind' (str): the kind of dog (herding, hound, toy, etc.). There are six total kinds.
• 'size' (str): small, medium, or large.
• 'longevity' (float): typical lifetime (years).
• 'price' (float): average purchase price (dollars).
• 'kids' (int): suitability for children. A value of 1 means high suitability, 2 means medium, and 3 means low.
• 'weight' (float): typical weight (kg).
• 'height' (float): typical height (cm).

The rows of df are arranged in no particular order. The first five rows of df are shown below (though df has many more rows than pictured here).

Assume: - We have already run import babypandas as bpd and import numpy as np.

Every year, the American Kennel Club holds a Photo Contest for dogs. Eric wants to know whether toy dogs win disproportionately more often than other kinds of dogs. He has collected a sample of 500 dogs that have won the Photo Contest. In his sample, 200 dogs were toy dogs.

Eric also knows the distribution of dog kinds in the population:

### Problem 1.1

Select all correct statements of the null hypothesis.

• The distribution of dogs in the sample is the same as the distribution in the population. Any difference is due to chance.

• Every dog in the sample was drawn uniformly at random without replacement from the population.

• The number of toy dogs that win is the same as the number of toy dogs in the population.

• The proportion of toy dogs that win is the same as the proportion of toy dogs in the population.

• The proportion of toy dogs that win is 0.3.

• The proportion of toy dogs that win is 0.5.

A null hypothesis is the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. Let’s consider what a potential null hypothesis might look like. A potential null hypothesis would be that there is no difference between the win proportion of toy dogs compared to the proportion of toy dogs in the population.

• Option 1: We’re not really looking at the distribution of dogs in our sample vs. dogs in our population, rather, we’re looking at whether toy dogs win more than other dogs. In other words, the only factors we’re really consdiering are the proportion of toy dogs to normal dogs, as well as the win percentages of toy dogs to normal dogs; and so the distribution of the population doesn’t really matter. Furthermore, this option makes no referance to win rate of toy dogs.

• Option 2: This isn’t really even a null hypothesis, but rather more of a description of a test procedure. This option also makes no attempt to referance to win rate of toy dogs.

• Option 3: This statement doesn’t really make sense in that it is illogical to compare the raw number of toy dogs wins to the number of toy dogs in the population, because the number of toy dogs is always at least the number of toy dogs that win. Rejecting this null hypothesis would only reject an extreme case within the subset of what we’re trying to prove.

• Option 4: This statement is in line with the null hypothesis.

• Option 5: This statement is another potential null hypothesis since the proportion of toy dogs in the population is 0.3.

• Option 6: This statement, although similar to Option 5, would not be a null hypothesis because 0.5 has no relevance to any of the relevant proportions. While it’s true that if the proportion of of toy dogs that win is over 0.5, we could maybe infer that toy dogs win the majority of the times; however, the question is not to determine whether toy dogs win most of the times, but rather if toy dogs win a disproportionately high number of times relative to its population size.

##### Difficulty: ⭐️⭐️

The average score on this problem was 83%.

### Problem 1.2

Select the correct statement of the alternative hypothesis.

• The model in the null hypothesis underestimates how often toy dogs win.

• The model in the null hypothesis overestimates how often toy dogs win.

• The distribution of dog kinds in the sample is not the same as the population.

• The data were not drawn at random from the population.

The alternative hypothesis is the hypothesis we’re trying to support, which in this case is that toy dogs happen to win more than other dogs.

• Option 1: This is in line with our alternative hypothesis, since proving that the null hypothesis underestimates how often toy dogs win means that toy dogs win more than other dogs.

• Option 2: This is the opposite of what we’re trying to prove.

• Option 3: We don’t really care too much about the distribution of dog kinds, since that doesn’t help us determine toy dog win rates compared to other dogs.

• Option 4: This isn’t a hypothesis, rather, it’s more of a description of a procedure.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.

### Problem 1.3

Select all the test statistics that Eric can use to conduct his hypothesis.

• The proportion of toy dogs in his sample.

• The number of toy dogs in his sample.

• The absolute difference of the sample proportion of toy dogs and 0.3.

• The absolute difference of the sample proportion of toy dogs and 0.5.

• The TVD between his sample and the population.

• Option 1: This option is correct. According to our null hypothesis, we’re trying to compare the proportion of toy dogs win rates to the proportion of toy dogs. Thus taking the proportion of toy dogs in Eric’s sample is a perfectly valid test statistic.

• Option 2: This option is incorrect. The raw number of toy dogs in his sample doesn’t really tell us how much toy dogs are winning compared to the rest of the population. Looking back at our null hypothesis, we’re trying to compare two proportions.

• Option 3: This option is incorrect. The absolute difference of the sample proportion of toy dogs and 0.3 doesn’t help us because the absolute difference won’t tell us whether or not the sample proportion of toy dogs is lower than 0.3 or higher than 0.3.

• Option 4: This option is incorrect for the same reasoning as above, but also 0.5 isn’t a relevant number anyways.

• Option 5: This option is incorrect. Again, total variation distance won’t help us tell whether or not the toy dogs have a disproportionately higher or lower win rate.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

### Problem 1.4

Eric decides on this test statistic: the proportion of toy dogs minus the proportion of non-toy dogs. What is the observed value of the test statistic?

• -0.4

• -0.2

• 0

• 0.2

• 0.4

For our given sample, the proportion of toy dogs is \frac{200}{500}=0.4 and the proportion of non-toy dogs is \frac{500-200}{500}=0.6, so 0.4 - 0.6 = -0.2.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.

### Problem 1.5

Which snippets of code correctly compute Eric’s test statistic on one simulated sample under the null hypothesis? Select all that apply. The result must be stored in the variable stat. Below are the 5 snippets

Snippet 1:

a = np.random.choice([0.3, 0.7])
b = np.random.choice([0.3, 0.7])
stat = a - b

Snippet 2:

a = np.random.choice([0.1, 0.2, 0.3, 0.2, 0.15, 0.05])
stat = a - (1 - a)

Snippet 3:

a = np.random.multinomial(500, [0.1, 0.2, 0.3, 0.2, 0.15, 0.05]) / 500
stat = a[2] - (1 - a[2])

Snippet 4:

a = np.random.multinomial(500, [0.3, 0.7]) / 500
stat = a[0] - (1 - a[0])

Snippet 5:

a = df.sample(500, replace=True)
b = a[a.get("kind") == "toy"].shape[0] / 500
stat = b - (1 - b)
• Snippet 1

• Snippet 2

• Snippet 3

• Snippet 4

• Snippet 5

Answer: Snippet 3 & Snippet 4

• Snippet 1: This is incorrect because np.random.choice() only chooses values that are either 0.3 or 0.7 which is simply just wrong.

• Snippet 2: This is wrong because np.random.choice() only chooses from the values within the list. From a sanity check it’s not hard to realize that a should be able to take on more values than the ones in the list.

• Snippet 3: This option is correct. Recall, in np.random.multinomial(n, [p_1, ..., p_k]), n is the number of experiments, and [p_1, ..., p_k] is a sequence of probability. The method returns an array of length k in which each element contains the number of occurrences of an event, where the probability of the ith event is p_i. In this snippet, np.random.multinomial(500, [0.1, 0.2, 0.3, 0.2, 0.15, 0.05]) generates a array of length 6 (len([0.1, 0.2, 0.3, 0.2, 0.15, 0.05])) that contains the number of occurrences of each kinds of dogs according to the given distribution (the population distribution). We divide the first line by 500 to convert the number of counts in our resulting array into proportions. To access the proportion of toy dogs in our sample, we take the entry with the probability ditribution value of 0.3, which is the third entry in the array or a[2]. To calculate our test statistic we take the proportion of toy dogs minus the proportion of non-toy dogs or a[2] - (1 - a[2])

• Snippet 4: This option is correct. This approach is similar to the one above except we’re only considering the probability distribution of toy dogs vs non-toy dogs, which is what we wanted in the first place. The rest of the steps are similar to the ones above.

• Snippet 5: Note that df is simple just a dataframe containing information of the dogs, and may or may not reflect the population distribution of dogs that participate in the photo contest.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.

### Problem 1.6

After simulating, Eric has an array called sim that stores his simulated test statistics, and a variable called obs that stores his observed test statistic.

What should go in the blank to compute the p-value?

np.mean(sim _______ obs)
• <

• <=

• ==

• >=

• >

Answer: Option 4: >=

Note that to calculate the p-value we look for test statistics that are equal to the observed statistic or even further in the direction of the alternative. In this case, if the proportion of the population of toy dogs compared to the rest of the dog population was higher than observed, we’d get a value larger than 0.2, and thus we use >=.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.

### Problem 1.7

Eric’s p-value is 0.03. If his p-value cutoff is 0.01, what does he conclude?

• He rejects the null in favor of the alternative.

• He accepts the null.

• He accepts the aleternative.

• He fails to reject the null.

Answer: Option 4: He fails to reject the null

• Option 1: Note that since our p-value was greater than 0.01, we fail to reject the null.

• Option 2: We can never “accept” the null hypothesis.

• Option 3: We didn’t accept the alternative since we failed to reject the null.

• Option 4: This option is correct becuase our p-value was larger than our cutoff.

##### Difficulty: ⭐️⭐️

The average score on this problem was 86%.

## Problem 2

For this question, let’s think of the data in app_data as a random sample of all IKEA purchases and use it to test the following hypotheses.

Null Hypothesis: IKEA sells an equal amount of beds (category 'bed') and outdoor furniture (category 'outdoor').

Alternative Hypothesis: IKEA sells more beds than outdoor furniture.

The DataFrame app_data contains 5000 rows, which form our sample. Of these 5000 products,

• 1000 are beds,
• 1500 are outdoor furniture, and
• 2500 are in another category.

### Problem 2.1

Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.

• Among 2500 beds and outdoor furniture items, the absolute difference between the proportion of beds and the proportion of outdoor furniture.

• Among 2500 beds and outdoor furniture items, the proportion of beds.

• Among 2500 beds and outdoor furniture items, the number of beds.

• Among 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items.

Answer: Among 2500 beds and outdoor furniture items, the proportion of beds.
Among 2500 beds and outdoor furniture items, the number of beds.

Our test statistic needs to be able to distinguish between the two hypotheses. The first option does not do this, because it includes an absolute value. If the absolute difference between the proportion of beds and the proportion of outdoor furniture were large, it could be because IKEA sells more beds than outdoor furniture, but it could also be because IKEA sells more outdoor furniture than beds.

The second option is a valid test statistic, because if the proportion of beds is large, that suggests that the alternative hypothesis may be true.

Similarly, the third option works because if the number of beds (out of 2500) is large, that suggests that the alternative hypothesis may be true.

The fourth option is invalid because out of 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items is always 2500. So the value of this statistic is constant regardless of whether the alternative hypothesis is true, which means it does not help you distinguish between the two hypotheses.

##### Difficulty: ⭐️⭐️

The average score on this problem was 78%.

### Problem 2.2

Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.

Complete the code below to calculate the observed value of the test statistic and save the result as obs_diff.

    outdoor = (app_data.get('category')=='outdoor')
bed = (app_data.get('category')=='bed')
obs_diff = ( ___(a)___ - ___(b)___ ) / ___(c)___

The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.

Answer: Reading the table from top to bottom, the five expressions should be used in the following blanks: None, (b), (a), (c), None.

The correct way to define obs_diff is

    outdoor = (app_data.get('category')=='outdoor')
bed = (app_data.get('category')=='bed')
obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

The first provided line of code defines a boolean Series called outdoor with a value of True corresponding to each outdoor furniture item in app_data. Using this as the condition in a query results in a DataFrame of outdoor furniture items, and using .shape[0] on this DataFrame gives the number of outdoor furniture items. So app_data[outdoor].shape[0] represents the number of outdoor furniture items in app_data. Similarly, app_data[bed].shape[0] represents the number of beds in app_data. Likewise, app_data[outdoor | bed].shape[0] represents the total number of outdoor furniture items and beds in app_data. Notice that we need to use an or condition (|) to get a DataFrame that contains both outdoor furniture and beds.

We are told that the test statistic should be the proportion of outdoor furniture minus the proportion of beds. Translating this directly into code, this means the test statistic should be calculated as

    obs_diff = app_data[outdoor].shape[0]/app_data[outdoor | bed].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Since this is a difference of two fractions with the same denominator, we can equivalently subtract the numerators first, then divide by the common denominator, using the mathematical fact \frac{a}{c} - \frac{b}{c} = \frac{a-b}{c}.

    obs_diff = (app_data[outdoor].shape[0] - app_data[bed].shape[0]) / app_data[outdoor | bed].shape[0]

Notice that this is the observed value of the test statistic because it’s based on the real-life data in the app_data DataFrame, not simulated data.

##### Difficulty: ⭐️

The average score on this problem was 90%.

### Problem 2.3

Suppose we generate 10,000 simulated values of the test statistic according to the null model and store them in an array called simulated_diffs. Complete the code below to calculate the p-value for the hypothesis test.

    np.count_nonzero(simulated_diffs _________ obs_diff)/10000

What goes in the blank?

• <

• <=

• >

• >=

Answer: <=

To answer this question, we need to know whether small values or large values of the test statistic indicate the alternative hypothesis. The alternative hypothesis is that IKEA sells more beds than outdoor furniture. Since we’re calculating the proportion of outdoor furniture minus the proportion of beds, this difference will be small (negative) if the alternative hypothesis is true. Larger (positive) values of the test statistic mean that IKEA sells more outdoor furniture than beds. A value near 0 means they sell beds and outdoor furniture equally.

The p-value is defined as the proportion of simulated test statistics that are equal to the observed value or more extreme, where extreme means in the direction of the alternative. In this case, since small values of the test statistic indicate the alternative hypothesis, the correct answer is <=.

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.