Summer 2024 Final Exam

← return to practice.dsc10.com


Instructor(s): Nishant Kheterpal

This exam was administered in-person. The exam was closed-notes, except students were provided a copy of the DSC 10 Reference Sheet. No calculators were allowed. Students had 3 hours to take this exam.


In this exam, you’ll work with a data set representing the results of the Tour de France, a multi-stage, weeks-long cycling race. The Tour de France takes place over many days each year, and on each day, the riders compete in individual races called stages. Each stage is a standalone race, and the winner of the entire tour is determined by who performs the best across all of the individual stages combined. Each row represents one stage of the Tour (or equivalently, one day of racing). This dataset will be called stages.

The columns of stages are as follows:

The first few rows of stages are shown below, though stages has many more rows than pictured.


Throughout this exam, we will refer to stages repeatedly. Assume that we have already run import babypandas as bpdand import numpy as np.


Problem 1


Problem 1.1

Fill in the blanks so that the expression below evaluates to the proportion of stages won by the country with the most stage wins.

    stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]

Answer:

  • (i): "Winner Country"
    To calculate the number of stages won by each country, we need to group the data by the Winner Country. This will allow us to compute the counts for each group.

  • (ii): count()
    Once the data is grouped, we use the .count() method to calculate the number of stages won by each country.

  • (iii): max()
    Finds the maximum number of stages won by a single country. Finally, we divide the maximum stage wins by the total number of stages (stages.shape[0]) to calculate the proportion of stages won by the top country.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 1.2

The distance of a stage alone does not encapsulate its difficulty, as riders feel more tired as the tour goes on. Because of this, we want to consider “real distance” a measurement of the length of a stage that takes into account how far into the tour the riders are. The “real distance” is calculated with the following process:

  1. Add one to the stage number.

  2. Take the square root of the result of (i).

  3. Multiply the result of (ii) by the raw distance of the stage.

Complete the implementation of the function real_distance, which takes in stages (a DataFrame), stage (a string, the name of the column containing stage numbers), and distance (a string, the name of the column containing stage distances). real_distance returns a Series containing all of the “real distances” of the stages, as calculated above.

    def real_distance(stages, stage, distance):
         ________

Answer: return stages.get(distance) * np.sqrt(stages.get(stage) + 1)

  • (i): First, We need to add one to the stage number. The stage parameter specifies the name of the column containing the stage numbers. stages.get(stage) retrieves this column as a Series, and we can directly add 1 to each element in the series by stages.get(stage) + 1

  • (ii): Then, to take the square root of the result of (i), we can use np.sqrt(stages.get(stage) + 1)

  • (iii): Finally, we want to multiply the result of (ii) by the raw distance of the stage. The distance parameter specifies the name of the column containing the raw distances of each stage. stages.get(distance) retrieves this column as a pandas Series, and we can directly multiply it by np.sqrt(stages.get(stage) + 1).


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 1.3

Sometimes, stages are repeated in different editions of the Tour de France, meaning that there are some pairs of "Origin" and "Destination" that appear more than once in stages. Fill in the blanks so that the expression below evaluates how often the most common "Origin" and "Destination" pair in the stages DataFrame appears.

stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]

Answer:

  • (i): ["Origin", "Destination"]
    To analyze the frequency of stages with the same origin and destination, we need to group the data by the columns ["Origin", "Destination"]. This groups the stages into unique pairs of origin and destination.

  • (ii): count()
    After grouping, we apply the .count() method to calculate how many times each pair of ["Origin", "Destination"] appears in the dataset. The result is the frequency of each pair.

  • (iii): -1
    After obtaining the frequencies, we sort the resulting groups by their counts in ascending order (this is the default behavior of .sort_values()). The most common pair will then be the last entry in the sorted result. Using .get("Type") extracts the series of counts, and .iloc[-1] retrieves the count of the most common pair, which is at the last position of the sorted series.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 1.4

Fill in the blanks so that the value of mystery_three is the "Destination" of the longest stage before Stage 12.

    mystery = stages[stages.get(__(i)__) < 12]
    mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
    mystery_three = mystery_two.get(__(iii)__).iloc[-1]

Answer:

  • (i): "Stage"
    To filter the DataFrame to include only rows corresponding to stages before Stage 12, we use the "Stage" column. The condition stages.get("Stage") < 12 creates a boolean mask that selects only the rows where the stage number is less than 12.

  • (ii): True
    To find the longest stage, the rows need to be sorted by the "Distance" column. Setting ascending=True ensures that shorter stages come first and the longest stage appears last in the sorted DataFrame.

  • (iii): "Destination"
    After sorting, we want to retrieve the "Destination" of the longest stage. Using .get("Destination") retrieves the "Destination" column, and .iloc[-1] accesses the last row in the sorted DataFrame, corresponding to the longest stage before Stage 12.


Difficulty: ⭐️

The average score on this problem was 92%.



Problem 2

Suppose we run the following code to simulate the winners of the Tour de France.

    evenepoel_wins = 0
    vingegaard_wins = 0
    pogacar_wins = 0
    for i in np.arange(4):
        result = np.random.multinomial(1, [0.3, 0.3, 0.4])
        if result[0] == 1:
            evenepoel_wins = evenepoel_wins + 1
        elif result[1] == 1:
            vingegaard_wins = vingegaard_wins + 1
        elif result[2] == 1:
            pogacar_wins = pogacar_wins + 1


Problem 2.1

What is the probability that pogacar_wins is equal to 4 when the code finishes running? Do not simplify your answer.

Answer: 0.4 ^ 4

  • The code runs a loop 4 times, simulating the outcome of each iteration using np.random.multinomial(1, [0.3, 0.3, 0.4]).
  • The probability that pogacar wins in a single iteration is 0.4 (the third entry in the probability vector [0.3, 0.3, 0.4]).
  • To win all 4 iterations, pogacar must win independently in each iteration.
  • Since the trials are independent, the probability is calculated as: 0.4 * 0.4 * 0.4 * 0.4 = 0.4 ^ 4

Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 2.2

What is the probability that evenepoel_wins is at least 1 when the code finishes running? Do not simplify your answer.

Answer: 1 - 0.7 ^ 4

  • The probability that evenepoel wins in a single iteration is 0.3 (the first entry in the probability vector [0.3, 0.3, 0.4]).
  • The complement of “at least 1 win” is “no wins at all.” To calculate the complement:
    • The probability that evenepoel does not win in a single iteration is: 1 - 0.3 = 0.7
    • For evenepoel to win no iterations across all 4 loops, they must fail to win independently in each iteration: 0.7 * 0.7 * 0.7 * 0.7 = 0.7 ^ 4
  • The probability that evenepoel_wins is at least 1 is then: 1 - 0.7 ^ 4

Difficulty: ⭐️⭐️

The average score on this problem was 83%.



Problem 3

We want to estimate the mean distance of Tour de France stages by bootstrapping 10,000 times and constructing a 90% confidence interval for the mean. In this question, suppose random_stages is a random sample of size 500 drawn with replacement from stages. Identify the line numbers with errors in the code below. In the adjacent box, point out the error by describing the mistake in less than 10 words or writing a code snippet (correct only the part you think is wrong). You may or may not need all the spaces provided below to identify errors.


    line 1:      means = np.array([])
    line 2: 
    line 3:      for i in 10000:
    line 4:          resample = random_stages.sample(10000)
    line 5:          resample_mean = resample.get("Distance").mean()
    line 6:          np.append(means, resample_mean)
    line 7:    
    line 8:      left_bound = np.percentile(means, 0)
    line 9:      right_bound = np.percentile(means, 90)

Answer:

  • a): 3: for i in np.arange(10000):
    • The for loop syntax is incorrect. 10000 is an integer, not an iterable. To iterate 10,000 times, np.arange(10000) must be used.
  • b): 4: random_stages.sample(500, replace=True)
    • The bootstrap sample size should be 500 (matching the original sample size). Additionally, replace=True is required for sampling with replacement.
  • c): 6: means = np.append(means, resample_mean)
    • np.append does not modify the array in place. The means array must be reassigned to include the new value.
  • d): 8: np.percentile(means, 5)
    • A 90% confidence interval captures the middle 90% of the data or distribution. This means we exclude 10% of the data: 5% from the lower tail and 5% from the upper tail. The 0th percentile is incorrect for a 90% confidence interval. The lower bound should be the 5th percentile.
  • e): 9: np.percentile(means, 95)
    • The 90th percentile is incorrect for a 90% confidence interval. The upper bound should be the 95th percentile.
  • f): N/A: No more errors.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 4

Below is a density histogram representing the distribution of randomly sampled stage distances.


Problem 4.1

Which statement below correctly describes the relationship between the mean and the median of the sampled stage distances?

Answer: The mean is approximately equal to the median.

  • The histogram appears to be approximately symmetric, with the peak near the center of the distribution.
  • For symmetric distributions, the mean and the median are approximately equal because the data is evenly distributed around the central point.
  • If the distribution were skewed:
    • A right-skewed distribution would have the mean significantly larger than the median.
    • A left-skewed distribution would have the mean significantly smaller than the median.
  • In this case, there is no visible skew, so the correct answer is that the mean is approximately equal to the median.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.



Problem 4.2

Assume there are 100 stages in the random sample that generated this plot. If there are 5 stages in the bin [275, 300), approximately how many stages are in the bin [200, 225)?

Answer: 35 = 5\cdot7

  • The height of the bin [200, 225) on the density histogram is approximately 7 times the height of the bin [275, 300).
  • Since the number of stages in a bin is proportional to the bin’s height, the number of stages in [200, 225) is 35 = 5\cdot7.

Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 4.3

Assume the mean distance is 200 km and the standard deviation is 50 km. At least what proportion of stage distances are guaranteed to lie between 0 km and 400 km? Do not simplify your answer.

Answer: \frac{15}{16}

Using Chebyshev’s inequality, we know at least 1 - \frac{1}{z^2} of the data lies within z SDs. Here, z = 4 so we know 1 - \frac{1}{16} = \frac{15}{16} of the data lie in that range.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 4.4

Again, assume the mean stage distance is 200 km and the standard deviation is 50 km. Now, suppose we take a random sample of size 25 from the stage distances, calculate the mean stage distance of this sample, and repeat this process 500 times. What proportion of the means that we calculate will fall between 190 km and 210 km? Do not simplify your answer.

Answer: 68%

We know about 68% of values lie within 1 standard deviation of the mean of any normal distribution. The distribution of means of samples of size 25 from this dataset is normally distributed with mean 200km and SD \frac{50}{\sqrt{25}} = 10, so 190km to 210km contains 68% of the values.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.


Problem 4.5

Assume the mean distance is 200 km and the standard deviation is 50 km. Suppose we use the Central Limit Theorem to generate a 95% confidence interval for the true mean distance of all Tour de France stages, and get the interval [190\text{ km}, 210\text{ km}]. Which of the following interpretations of this confidence interval are correct?

Answer: Option 3, Option 4, and Option 7

Option 1:
Incorrect. Confidence intervals describe the uncertainty in estimating the population mean, not the proportion of data points. A 95% confidence interval does not imply that 95% of individual stage distances fall between 190 km and 210 km.

Option 2:
Incorrect. Confidence intervals are based on the sampling process, not probability. Once the interval is calculated, the true mean is either inside or outside the interval. We cannot assign a probability to this.

Option 3:
Correct. This is the standard interpretation of confidence intervals: “We are 95% confident that the true mean lies within the interval.”

Option 4:
Correct. Given a sample size of 100 and population standard deviation of 50, the confidence interval ([190, 210]) is consistent with the calculation using the rule of thumb that a 95% confidence interval is approximately 2 standard deviations apart from the mean.

For a 95% confidence interval, the range can be approximated as:

\left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

Substituting the given values:

  • Sample mean = 200
  • Sample standard deviation = 50
  • Sample size = 100 \text{CI} = \left[ 200 - 2\cdot \frac{50}{\sqrt{100}}, 200 + 2\cdot \frac{50}{\sqrt{100}} \right] Simplify: \text{CI} = \left[ 200 - 2\cdot 5, 200 + 2\cdot 5 \right] \text{CI} = [190, 210]

Option 5:
Incorrect. refer to option 4

Option 6:
Incorrect. The wording “exactly 95%” is overly precise. In practice, confidence intervals are based on the sampling process, and we use “approximately” or “roughly” 95%.

Option 7:
Correct. By definition of a confidence interval, if we repeatedly sampled and constructed 95% confidence intervals, roughly 95% of them would contain the true mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 4.6

Suppose we take 500 random samples of size 100 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram A. Then, we take 500 random samples of size 1000 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram B. Fill in the blanks so that the sentence below correctly describes how Histogram B looks in comparison to Histogram A.

“Relative to Histogram A, Histogram B would appear __(i)__ and shifted __(ii)__ due to the __(iii)__ mean and the __(iv)__ standard deviation.”

(i):

(ii):

(iii):

(iv):

Answer:

  • (i): thinner
    • Histogram B would appear thinner because larger sample sizes reduce the variability of the sample means. With a sample size of 1000 (compared to 100 for Histogram A), the standard error decreases, leading to a narrower distribution.
  • (ii): not at all
    • Histogram B would not shift left or right because the sample mean does not depend on the sample size. Both histograms have the same mean, as they are based on the same population.
  • (iii): unchanged
    • The mean remains unchanged because the mean of the sampling distribution (the population mean) does not depend on sample size.
  • (iv): smaller
    • The standard deviation of Histogram B is smaller because \text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}
    Increasing sample size from 100 to 1000 decreases the Sample standard deviation as the Population standard deviation remains unchanged, leading to a smaller standard deviation for the sampling distribution.

Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 5

In this question, suppose random_stages is a random sample of undetermined size drawn with replacement from stages. We want to estimate the proportion of stage wins won by each country.


Problem 5.1

Suppose we extract the winning countries and store the resulting Series. Consider the variable winners defined below, which you may use throughout this question:

winners = random_stages.get("Winner Country")

Write a single line of code that evaluates to the proportion of stages in random_stages won by France (country code "FRA").

Answer: np.mean(winners == "FRA") or np.count_nonzero(winners == "FRA") / len(winners)

winners == "FRA" creates a Boolean array where each element is True if the corresponding value in the winners Series equals "FRA", and False otherwise. In Python, True is equivalent to 1 and False is equivalent to 0 when used in numerical operations. np.mean(winners == "FRA") computes the average of this Boolean array, which is equivalent to the proportion of True values (i.e., the proportion of stages won by "FRA").

Alternatively, you can use np.count_nonzero(winners == "FRA") / len(winners), which counts the number of True values and divides by the total number of entries to compute the proportion.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 5.2

We want to generate a 95% confidence interval for the true proportion of wins by France in stages by using our random sample random_stages. How many rows need to be in random_stages for our confidence interval to have width of at most 0.03? Recall that the maximum standard deviation for any series of zeros and ones is 0.5. Do not simplify your answer.

Answer: 4 * \frac{0.5}{\sqrt{n}} \leq 0.03

For a 95% confidence interval for the population mean: \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

Note that the width of our CI is the right endpoint minus the left endpoint:

\text{width} = 4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}

Substitute the maximum standard deviation (sample SD = 0.5) for any series of zeros and ones and set the width to be at most 0.03: 4 \cdot \frac{0.5}{\sqrt{n}} \leq 0.03.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 5.3

Suppose we now want to test the hypothesis that the true proportion of stages won by Italy ("ITA") is 0.2 using a confidence interval and the Central Limit Theorem. We want to conduct our hypothesis test at a significance level of 0.01. Fill in the blanks to construct the confidence interval [interval_left, interval_right]. Your answer must use the Central Limit Theorem, not bootstrapping. Assume an integer variable sample_size = len(winners) has been defined, regardless of your answer to part 2.

Hint:

stats.norm.cdf(2.576) - stats.norm.cdf(-2.576) = 0.99
    interval_center = __(i)__
    mystery = __(ii)__ * np.std(__(iii)__ ) / __(iv)__
    interval_left = interval_center - mystery
    interval_right = interval_center + mystery

The confidence interval for the true proportion is given by: \left[ \text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

  • (i): np.mean(winners == "ITA") or (winners == "ITA").mean()
    • The center of the confidence interval is the sample mean proportion of stages won by Italy. This is calculated by taking the mean of the Boolean array where winners equals "ITA". The Boolean array has values 1 (if true) or 0 (if false), so the mean directly represents the proportion.
  • (ii): 2.576
    • This is the critical value corresponding to a 99% confidence level. To have 99% of the data between -z and z, the area under the curve outside of this range is: 1 - 0.99 = 0.01, split equally between the two tails (0.005 in each tail). This means: P(-z \leq Z \leq z) = 0.99.

    • The CDF at z = 2.576 captures 99.5% of the area to the left of 2.576: \text{stats.norm.cdf}(2.576) \approx 0.995.

    • Similarly, the CDF at z = -2.576 captures 0.5% of the area to the left: \text{stats.norm.cdf}(-2.576) \approx 0.005.

    • The total area between -2.576 and 2.576 is: \text{stats.norm.cdf}(2.576) - \text{stats.norm.cdf}(-2.576) = 0.995 - 0.005 = 0.99.

      This confirms that z = 2.576 is the correct critical value for a 99% confidence level.

  • (iii): winners == "ITA"
    • The standard deviation is calculated using the Boolean array where winners equals "ITA".
  • (iv): np.sqrt(sample_size)
    • The denominator is the square root of the sample size. This is consistent with the Central Limit Theorem.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.


Problem 5.4

What is our null hypothesis?

Answer: The true proportion of stages won by Italy is 0.2.

The null hypothesis assumes that there is no difference or effect. Here, it states that the true proportion of stages won by Italy equals 0.2.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 5.5

What is our alternative hypothesis?

Answer: The true proportion of stages won by Italy is not 0.2.

The alternative hypothesis is the opposite of the null hypothesis. Here, we test whether the true proportion of stages won by Italy is different from 0.2.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 5.6

Suppose we calculated the interval [0.195, 0.253] using the above process. Should we reject or fail to reject our null hypothesis?

Answer: Fail to reject.

The confidence interval calculated is [0.195, 0.253], and the null hypothesis value (0.2) lies within this interval. This means that 0.2 is a plausible value for the true proportion at the 99% confidence level. Therefore, we do not have sufficient evidence to reject the null hypothesis.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.



Problem 6

You want to use the data in stages to test the following hypotheses:

For the rest of this problem, assume you have assigned a new column to stages called class, which categorizes stages into either flat or mountain stages.


Problem 6.1

Which of the following test statistics could be used to test the given hypothesis? Select all that apply.

Answer: Option 1, Option 2, and Option 4

A test statistic is a single number we use to test which viewpoint the data better supports. During hypothesis testing, we check whether our observed statistic is a “typical value” in the distribution of the test statistic. The alternative hypothesis indicates “less than” so our test statistic needs to summarize both the magnitude and direction of the difference in the categories.

  • Option 1 is correct. The mean number of flat stages divided by the mean distance of flat stages is essentially a ratio which is a valid test statistic.
  • Option 2 is correct. The difference between the mean distance of mountain stages and the mean distance of flat stages gives direction and magnitude of the difference between the categories so its a valid test statistic.
  • Option 3 is incorrect. Taking the absolute value of the difference between the mean distance of flat stages and the mean distance of mountain stages removes the direction of the difference making this an invalid test statistic.
  • Option 4 is correct. One half of the difference between the mean distance of flat stages and the mean distance of mountain stages gives both magnitude and direction of the difference between the categories so this is a valid test statistic.
  • Option 5 is incorrect. Squaring the difference between the mean distance of flat stages and the mean distance of mountain stages removes the direction of the difference between the categories so this is an invalid test statistic.

Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Assume that for the rest of the question, we will be using the following test statistic: The difference between the mean distance of flat stages and the mean dis- tance of mountain stages.


Problem 6.2

Fill in the blanks in the code below so that it correctly conducts a hypothesis test of the given hypotheses and returns the p-value.


def hypothesis_test(stages):
    means = stages.groupby("class").mean().get("Distance")
    observed_stat = means.loc["flat"] - means.loc["mountain"]
    
    simulated_stats = np.array([])
    for i in np.arange(10000):
        shuffled = stages.assign(shuffled = np.random.__(i)__(stages.get("Distance")))
        shuffled_means = shuffled.groupby("class").mean().get("Distance")
        simulated_stat = (shuffled_means.loc["flat"] - shuffled_means.loc["mountain"])
        simulated_stats = __(ii)__(simulated_stats, simulated_stat)
    
    p_value = np.__(iii)__(simulated_stats <= observed_stat)
    return p_value

Answer:

  • (i): permutation
  • (ii): np.append
  • (iii): mean

The first step in a permutation test simulation is to shuffle the labels or the values. So since this first line in the for loop is assigning a column called ‘shuffled’, we know we need to use np.random.permutation() on the "Distances" column. The next line gets the new means for each group after shuffling the values and simulated_stat is the simulated difference in means. Now we know we want to save this simulated statistic and we have the simulated_stats array, so we want to use an np.append in (ii) to save this statistic in the array. Finally after the simulation is complete, we calculate the p-value using the array of simulated statistics. The p-value is the probability of seeing the observed result under the null hypothesis. simulated_stats <= observed_stat returns an array of 0’s and 1’s depending on whether each simulated statistic is less than or equal to the observed statistic. Now, to get the probability of seeing a result equal to or less than the observed, we can simply take the mean of this array since the mean of an array of 0’s and 1’s is equivalent to the probability.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 6.3

Indicate whether each of the following code snippets would correctly calculate simulated_stat inside the for-loop without errors. Where present, assume the blank (i) has been filled in correctly.

shuffled = stages.assign(shuffled = np.random.__(i)__(stages.get("Distance")))
shuffled_flat = (shuffled[shuffled.get("class") == "flat"].get("shuffled"))
shuffled_mountain = (shuffled[shuffled.get("class") == "mountain"].get("shuffled"))
simulated_stat = shuffled_flat.mean() - shuffled_mountain.mean()

(i):

shuffled = stages.assign(shuffled = np.random.__(i)__(stages.get("class")))
shuffled_flat = (shuffled[shuffled.get("shuffled") == "flat"].get("Distance"))
shuffled_mountain = (shuffled[shuffled.get("shuffled") == "mountain"].get("Distance"))
simulated_stat = shuffled_flat.mean() - shuffled_mountain.mean()

(ii):

shuffled = stages.assign(shuffled = np.random.__(i)__(stages.get("Distance")))
shuffled_means = shuffled.groupby("class").mean()
simulated_stat = (shuffled_means.get("Distance").iloc["flat"] -
                  shuffled_means.get("Distance").iloc["mountain"])

(iii):

Answer:

  • (i): This code is correct
    • shuffled shuffles the distances. shuffled_flat gets the series of flats with the shuffled distances and shuffled_mountain gets the series of the mountains with the shuffled distances. Finally simulated_stat calculates the mean difference between the two categories.
  • (ii): The code is correct.
    • shuffled shuffles the labels. shuffled_flat gets the series of the distances with the shuffled label of “flat” and shuffled_mountain gets the series of the distances with the shuffled label of “mountain”. Finally, simulated_stat calculates the mean difference between the two categories.
  • (iii): The code is incorrect.
    • shuffled shuffles the distances and assigns these shuffled distances to the column ‘shuffled’. shuffled_means groups by the label and calculates the means for each column. However, simulated_stat takes the original distance columns when calculating the difference in means rather than the shuffled distances which is located in the ‘shuffled’ column making this answer incorrect.

Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 6.4

Assume that the observed statistic for this hypothesis test was equal to -22.5 km. Given that there are 10,000 simulated test statistics generated in the code above, at least how many of those must be equal to -22.5 km in order for us to reject the null hypothesis at an 0.05 significance level?

Answer: 0

In order to reject the null hypothesis at the 0.05 significance level, the p-value needs to be below 0.05. In order to calculate the p-value, we find the proportion of simulated test statistics that are equal to or less than the observed value. Note the usage of “must be” in the problem. Since these simulated test statistics can be even less than the observed value, none of them have to be equal to the observed value. Thus, the answer is 0.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 30%.



Problem 6.5

Assume that the code above generated a p-value of 0.03. In the space below, please write your interpretation of this p-value. Your answer should include more than simply “we reject/fail to reject the null hypothesis.”

Answer: There is a 3% chance, assuming the null hypothesis is true, of seeing an observed difference in means less than or equal to -22.5 km.

The p-value is the probability of seeing the observed value or something more extreme under the null hypothesis. Knowing this, in this context, since the p-value is 0.03, this means that there is a 3% chance under the null hypothesis of seeing an observed difference in means equal to or less than -22.5km.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.



Problem 7

You are analyzing the data in stages to see which countries winners come from.

You categorize the countries into four groups: France, Italy, Belgium, and Other. After performing some analysis, you find that the observed distribution of countries of origin for Tour de France stage winners is [0.3, 0.2, 0.1, 0.4]; i.e. 30% of stage winners are French, 20% Italian, etc. Based on census information, the expected population distribution is [0.07, 0.06, 0.01, 0.86]; that is, France’s total population is 7% of the sum of the populations of all countries participating in the Tour, Italy’s is 6%, etc.

You conduct a hypothesis test with the following hypotheses:


Problem 7.1

Which of the following test statistics are appropriate for this hypothesis test? Select all that apply.

Answer: Option 1 and Option 4

  • Option 1 is correct. The absolute difference between the expected proportion of French stage winners and the observed proportion of French stage winners gives the magnitude of the difference in distributions making this a valid test statistic.
  • Option 2 is incorrect. The sum of the differences between the expected population distribution and the observed distribution of stage winners is not a valid test statistic since it indicates a direction to the difference when we only want to know whether these distributions are different.
  • Option 3 is incorrect. The absolute difference between the number of French stage winners and the number of Italian stage winners is not a valid test statistic since the numbers in each population can be different and thus the difference in numbers is not a fair comparison.
  • Option 4 is correct. The sum of the absolute differences between the expected population distribution and the observed distribution of stage winners gives magnitude but does not indicate direction making this a valid test statistic.

Difficulty: ⭐️⭐️

The average score on this problem was 80%.


For the rest of this question, assume that we will be using the Total Variation Distance as our test statistic.


Problem 7.2

Complete the implementation of the simulate and calculate_test_stat functions so that the code below successfully simulates 10,000 test statistics.


expected_dist = [0.07, 0.06, 0.01, 0.86]
observed_dist = [0.3, 0.2, 0.1, 0.4]

def simulate(__(i)__):
    simulated_winners = np.random.__(ii)__(100, __(iii)__)
    return simulated_winners / 100

def calculate_test_stat(__(iv)__, __(v)__):
    return __(vi)__


observed_stat = calculate_test_stat(observed_dist, expected_dist)
simulated_stats = np.array([])
for i in np.arange(10000):
    simulated_dist = simulate(expected_dist)
    simulated_stat = calculate_test_stat(simulated_dist, expected_dist)
    simulated_stats = np.append(simulated_stats, simulated_stat)

Answer:

  • (i): expected_dist
  • (ii): multinomial
  • (iii): expected_dist
  • (iv): simulated_dist
  • (v): expected_dist (or swapped with above)
  • (vi): np.abs(simulated dist - expected dist).sum() / 2

When performing a simulation, we simulated based on the expectation. Thus, the argument for the simulate function (i) should be the expected_dist array.

In this function, we simulate winners based on the expected distribution. So, we want to use np.random.multinomial in (ii) which will take in the number of experiments and expected distribution, ie expected_dist in (iii), which is an array of the probabilities for each of the outcomes.

We are using the total variation distance as the test statistic. The Total Variation Distance (TVD) of two categorical distributions is the sum of the absolute differences of their proportions, all divided by 2. Thus, the arguments of the calculate_test_stat function should be the simulated_distribution in (iv) and the expected_distribution in (v) (or swapped).

In this function, we need to return the TVD which can be calculated as follows: np.abs(simulated dist - expected dist).sum() / 2 in (vi).


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 7.3

Fill in the blank in the following code so that p_value evaluates to the correct p-value for this hypothesis test:

    p_value = np.mean(simulated_stats ___ observed_statistic)

Answer: >=

Recall the p-value is the probability of seeing a result equal to or more extreme than the observed value under the null hypothesis. Since the TVD is our test statistic where greater values indicate a result more extreme that means we want to use >= in the blank to check whether the simulated statistic is equal to or more extreme than the observed statistic.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.



Problem 8

Choose the best tool to answer each of the following questions. Note the following:


Problem 8.1

What is the median distance of all Tour de France stages?

Answer: Bootstrapping

Since we want the median distance of all the Tour de France stages, we are not testing anything against a hypothesis at all which rules our hypothesis testing and permutation testing. We use bootstrapping to get samples in lieu of a population. Bootstrapping Tour de France distances will give samples of distances from which we can calculate the median of Tour de France stages.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 8.2

Is the distribution of Tour de France stage types from before 1960 the same as after 1960?

Answer: Permutation Testing

We are comparing whether the distributions before 1960 and after 1960 are different which means we want to do permutation testing which tests whether two samples come from the same population distribution.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.


Problem 8.3

Are there an equal number of destinations that start with letters from the first half of the alphabet and destinations that start with letters from the second half of the alphabet?

Answer: Hypothesis Testing

We are testing whether two values (the number of destinations that start with letters from the first half of the alphabet and the destinations that start with letters from the second half of the alphabet) are equal to each other which is an indicator to use a hypothesis test.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.


Problem 8.4

Are mountain stages with destinations in France from before 1970 longer than flat stages with destinations in Belgium from after 2000?

Answer: Permutation Testing

We are comparing two distributions (mountain stages with destinations in France from before 1970 and flat stages with destinations in Belgium from after 2000) and seeing if the first distribution is longer than the second. Since we are comparing distributions, we want to perform a permutation test.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.



Problem 9

Suppose the distance of a Tour de France stage and the time it takes to complete it are linearly associated with correlation coefficient r = \frac{2}{3}. Assume distances have a mean of 200 km and a standard deviation of 80 km. Times have a mean of 6 hours.


Problem 9.1

Suppose the regression line to predict the time a stage will take (in hours) based on its length (in km) predicts that a 160 km long stage will take 5 hours. What is the standard deviation of the time it takes to complete a stage?

Answer: 3 hours

We know that \bar{x} = 200, \sigma_x = 80, r = \frac{2}{3}, and \bar{y} = 6. In this problem we are given x = 160 and y = 5. In order to find the standard deviation of time here, we can start by standardizing our values:

x_{su} = \frac{160-200}{80} = -\frac{1}{2}

Then according to the formula: \text{Predicted} \: y_{\text{su}} = r \cdot x_{\text{su}}

\text{Predicted} \: y_{\text{su}} = \frac{2}{3} \cdot -\frac{1}{2} = \frac{1}{3}

Now that we have y in standard units, we can plug it into this formula to solve for the standard deviation of y:

y_{su} = \frac{y - \bar{y}}{\sigma_y}

- \frac{1}{3}= \frac{5-6}{\sigma_y}

\text{SD}_y = 3


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 9.2

Suppose, regardless of your answer to part 1, that stage completion times have a standard deviation of 1.5 hours. The other means, SD, and r are unchanged.

Stages in the middle of the Tour tend to be longer than those at the ends. Stage 14 is 60 km longer than stage 20, so we would expect it to take longer based on our linear association. How large will the difference in our predictions of stage completion times be?

Answer: \frac{3}{4} hours longer.

Since we are interested in how much longer stage 14 is than stage 20 based on this linear association, we first want to calculate the slope. Note that r=\frac{2}{3}, \text{SD}_y=1.5, and \text{SD}_x=80:

m = r \cdot \frac{\text{SD}_y}{\text{SD}_x}

m = \frac{2}{3} \cdot \frac{1.5}{80} = \frac{1}{80}

This means that for every additional 1km, time increases by \frac{1}{80}.

Since Stage 14 is 60km longer than Stage 20, we simply multiply our slope by 60, giving \frac{60}{80} = \frac{3}{4}. Thus we expect Stage 14 to take \frac{3}{4} hours longer.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.


Problem 9.3

Suppose a mandatory rest break of 30 minutes (0.5 hours) is implemented for all Tour de France stages. How would the slope of the regression line change?

Answer: It would stay the same.

Adding a 30 minute break to all the stages simply increases each stage’s time by an additional 30 minutes. This would not change the slope since adding time simply shifts the time data points right, but doesn’t change the relationship between distance and time.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.


Problem 9.4

Suppose the means and standard deviations above do not change (continue to assume times have a standard deviation of 1.5 hours), but the correlation coefficient r is different. If we predict a 360 km stage will take 9 hours, what is the value of r? Write a single number for r or “N/A” if it is not possible to answer.

Answer: r=1

We can follow a similar process to part 1, but instead solve for r now. First, we calculate x in standard units:

x_{\text{su}} = \frac{x - \bar{x}}{\text{SD}_x}

x_{\text{su}} = \frac{360-200}{80} = \frac{160}{80} = 2

Now find y in standard units:

y_{\text{su}} = \frac{y - \bar{y}}{\text{SD}_y}

y_{\text{su}} = \frac{9-6}{1.5} = \frac{3}{1.5} = 2

Now we can solve for r simply by y_{\text{su}} = r * x_{\text{su}}

2 = r * 2

r=1


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 35%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.