Winter 2024 Final Exam

← return to practice.dsc10.com


Instructor(s): Janine Tiefenbruck

This exam was administered in-person. The exam was closed-notes, except students were provided a copy of the DSC 10 Reference Sheet. No calculators were allowed. Students had 3 hours to take this exam.


The Olympic Games are the world’s leading international sporting event, dating back to ancient Greece. Today, we’ll explore data on modern Olympic medalists. Each row in the DataFrame olympians corresponds to a type of medal earned by one Olympic athlete in one year.

The columns of olympians are as follows:

The first few rows of olympians are shown below, though olympians has many more rows than pictured. The data in olympians is only a sample from the much larger population of all Olympic medalists.


Throughout this exam, assume that we have already run import babypandas as bpd and import numpy as np.


Problem 1

Which of the following columns would be an appropriate index for the olympians DataFrame?

Answer: None of these.

To decide what an appropriate index would be, we need to keep in mind that in each row, the index should have a unique value – that is, we want the index to uniquely identify rows of the DataFrame. In this case, there will obviously be repeats in "team" and "sport", since these will appear multiple times for each Olympic event. For the "name" column, we first must consider that there could be multiple people with the same name, which means we can’t assume that all the rows have unique values. Additionally, it’s even more likely that the same athlete could compete in multiple Olympics (for example, Michael Phelps competed in both 2008 and 2004). So, none of these options is a valid index.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 2

Frank X. Kugler has Olympic medals in three sports (wrestling, weightlifting, and tug of war), which is more sports than any other Olympic medalist. Furthermore, his medals for all three of these sports are included in the olympians DataFrame. Fill in the blanks below so that the expression below evaluates to "Frank X. Kugler".

                (olympians.groupby(__(a)__).__(b)__
                          .reset_index()
                          .groupby(__(c)__).__(d)__
                          .sort_values(by="Age", ascending=False)
                          .index[0])


Problem 2.1

What goes in blank (a)?

Answer: ['Name', 'Sport'] or ['Sport', 'Name']

The question wants us to find the name (Frank X. Kugler) who has records that correspond to three distinct sports. However, we know that the same athlete might have multiple records not because they participated in different sports but because they participated in the same sport for multiple years. Therefore, first, we groupby ‘Name’ and ‘Sport’ to create a dataframe with unique Name-Sport pairs. This is a dataframe that contains the athletes and their sports (for each athlete, their corresponding sports are distinct).


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 2.2

What goes in blank (b)?

Answer: .sum() or .mean() or .min(), etc.

Any aggregation methods applied on DataFrameGroupBy. Because we don’t care about the aggregated numeric value, we just want to remove cases which the same athlete participates in the same sport for multiple years, and get unique Name-Sport pairs. Notice .unique() is not correct because unique is not an aggregation method used on dataframe after grouping by. If you use .unique(), it will give you “AttributeError: ‘DataFrameGroupBy’ object has no attribute ‘unique’”. However, .unique() can be used after SeriesGroupBy. For more info: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.unique.html.


Difficulty: ⭐️

The average score on this problem was 96%.


Problem 2.3

What goes in blank (c)?

Answer: 'Name'

Now after reset_index, we have ‘Name’ and ‘Sport’ columns containing unique name-sport pairs. The objective is to count how many different sports each Olympian has medals in. To do that, we groupby ‘Name’ and later use the .count() method. This would, in effect, give a new DataFrame that has a count of how many times each name shows up in our previous DataFrame with unique Name-Sport pairs.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 2.4

What goes in blank (d)?

Answer: .count()

The .count() method is applied to each group. In this context, .count() will get the number of entries for each Olympian across different sports. Since the previous steps ensured each sport per Olympian is uniquely listed (due to the initial groupby on both 'Name' and 'Sport'). It does not matter what we sort_values by, because the .groupby('Name').count() method will just put a count of each 'Name' in all of the columns, regardless of the column name or what value was originally in it.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.



Problem 3

In olympians, "Weight" is measured in kilograms. There are 2.2 pounds in 1 kilogram. If we converted "Weight" to pounds instead, which of the following quantities would increase?

Answer: Options 1, 2, and 5 are correct.

  1. The mean of the new distribution where weight is in pounds would be given by 2.2 \cdot \text{mean in kilograms}. Since the mean in kilograms is positive, this quantity must then increase if we multiply it by 2.2. Intuitively this should make sense as we are basically scaling all the (positive) values by 2.2 (a positive constant), so we expect the values to increase and thus the mean to increase. So option 1 is correct.
  2. The standard deviation of the new distribution where weight is in pounds would be given by 2.2 \cdot \text{standard deviation in kilograms}. Since the standard deviation in kilograms is positive, this quantity must then increase if we multiply it by 2.2. Intuitively this should make sense as we are scaling all the values by 2.2, so the difference between larger and smaller values will be greater once they are scaled. Thus, we expect the spread of our distribution to be larger, and the standard deviation to increase. So option 2 is correct.
  3. We are basically interested in the proportion of x~i~ that satisfy \vert \frac{x_{i} \cdot μ}{σ} \vert<3, with x_{i} being the values of "Weight", μ being the mean in kg and σ being the standard deviation in kg. Once we scale everything by 2.2 to convert from kilograms to pounds, we have that
    \vert \frac{2.2x_{i}-2.2μ}{2.2σ}\vert<3\frac{2.2}{2.2}\vert \frac{x~i~*μ}{σ}\vert<3\vert \frac{x_{i} \cdot μ}{σ}\vert<3.
    Notice that after we scaled it to pounds, the equation still ends up the exact same as the one in kg. Thus, the proportion of "Weight" within 3 standard deviations of the mean stays the same. Intuitively this should make sense because we are scaling everything by the same amount, so the proportion of points that are a specific number of standard deviations away from the mean should be the same, since the standard deviation and mean get scaled as well. So option 3 is incorrect.
  4. Recall that the correlation coecient, r, is the average value of the product of two variables, when both variables are measured in standard units. In kilograms, "Weight"(kg) in standard units is \frac{x~i~−μ}{σ}. Similar to option 3, "Weight"(pounds) in standard units is \frac{2.2x_{i}-2.2μ}{2.2σ} = \frac{2.2}{2.2} \frac{x_{i} \cdot μ}{σ} = \frac{x_{i} \cdot μ}{σ}. Again, notice that the equation in pounds ends up the exact same as in kilograms. The same applies for "Height" in standard units. Since none of the variables change when measured in standard units, r doesn’t change. So option 4 is incorrect.
  5. Intuitively, this makes sense when we imagine a scatterplot with the y-axis (value being predicted) representing "Weight" and the x-axis representing "Height". We expect that the taller someone is, the heavier they will be. So we can expect a positive regression line slope between Weight and Height. When we convert Weight from kg to pounds, we are basically scaling every value in "Weight", making their values increase. When we scale the the weight values (y-values) to become bigger, we are making the regression slope even steeper, because an increase in Height (x) now corresponds to an even larger increase in Weight(y). So option 5 is correct.

Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 4

The Olympics are held every two years, in even-numbered years, alternating between the Summer Olympics and Winter Olympics. Summer Olympics are held in years that are a multiple of 4 (such as 2024), and Winter Olympics are held in years that are not a multiple of 4 (such as 2022 or 2026).

We want to add a column to olympics that contains either "Winter" or "Summer" by applying a function called season as follows:

    olympians.assign(Season=olympians.get("Year").apply(season))

Which of the following definitions of season is correct? Select all that apply.

Notes:

Way 1:

        def season(year):
            if year % 4 == 0:
                return "Summer"
            return "Winter"

Way 2:

        def season(year):
            return "Winter"
            if year % 4 == 0:
                return "Summer"

Way 3:

        def season(year):
            if year % 2 == 0:
                return "Winter"
            return "Summer"

Way 4:

        def season(year):
            if int(year / 4) != year / 4:
                return "Winter"
            return "Summer"

Answer: Way 1 and Way 4

  • Way 1: This function first checks if the year is divisible by 4, and returns “Summer” if it is. If the year isn’t divisible by 4, then the code inside that if statement won’t execute, and we move on to the next line of code, which just returns “Winter”, as we wanted. So way 1 is correct
  • Way 2 looks similar to way 1, but has one key difference: the return "Winter" line is before the if statement. Since nothing after a return statement gets executed (assuming that the return statement gets executed), no matter what the year is this function will always return “Winter”. So way 2 is incorrect
  • Way 3 is not correct because it doesn’t account for the fact that all years which are multiples of 4 are also multiples of 2. So even though it gets the 2022 Winter Olympics correct, it will also return winter for 2020 since 2020 % 4 = 0.
  • Way 4 is correct because it uses similar logic as way 1, just using a different method to check for divisibility. Instead of using the modulo operator, we check if casting year / 4 to an integer using int changes its value. If the two aren’t equal, then we know that year / 4 wasn’t an integer before casting, which means that the year isn’t divisible by 4 and we should return “Winter”. If the code inside the if statement doesn’t execute, then we know that the year is divisible by 4, so we return “Summer”.

Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 5

[(16 pts)]

In figure skating, skaters move around an ice rink performing a series of skills, such as jumps and spins. Ylesia has been training for the Olympics, and she has a set routine that she plans to perform.

Let’s say that Ylesia performs a skill successfully if she does not fall during that skill. Each skill comes with its own probability of success, as some skills are harder and some are easier. Suppose that the probabilities of success for each skill in Ylesia’s Olympic routine are stored in an array called skill_success.

For example, if Ylesia’s Olympic routine happened to only contain three skills, skill_success might be the array with values 0.92, 0.84, 0.92. However, her routine can contain any number of skills.


Problem 5.1

Ylesia wants to simulate one Olympic routine to see how many times she might fall. Fill in the function count_falls below, which takes as input an array skill_success and returns as output the number of times Ylesia falls during her Olympic routine.


        def count_falls(skill_success):
            falls = 0
            for p in skill_success:
                result = np.random.multinomial(1, __(a)__)
                falls = __(b)__
            return falls

Answer: (a): [p, 1-p], (b): falls + result[1] OR (a): [1-p, p], (b): falls + result[0]

First, we should think about what np.random.multinomial is trying to do here. It’s trying to make an array of how many times each scenario happened. There are 2 possible scenarios here, Ylesia succeeds or Ylesia falls. In this code, p is the probability that Ylesia succeeds a skill, and the probabilty that Ylesia does not succeed (she falls) will be 1-p. So to properly simulate how many times she falls, we should put [p, 1-p] in blank (a). This will make an array stored in result, with index 0 being how many times she succeeded (corresponds to p), and index 1 being how many times she fell (corresponds to 1-p). Since index 1 corresponds to the scenario in which she falls, in order to correctly increase the number of falls, we add falls by result[1]. Therefore, blank (b) is falls + result[1]. Likewise, you can change the order with (a): [1-p, p] and (b): falls + result[0] and it would still correctly simulate how many times she falls.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 5.2

Fill in the blanks below so that prob_no_falls evaluates to the exact probability of Ylesia performing her entire routine without falling.

        prob_no_falls = __(a)__
        for p in skill_success:
            prob_no_falls = __(b)__
        prob_no_falls

Answer: (a): 1, (b): prob_no_falls * p

The code given first asks us to define the variable prob_no_falls outside the for loop, and then assign something to that variable inside the for loop. To fill in the blanks, let’s first look at what we are looping through. We know that skill_success is an array of probabilities of success for each skill. In other words, they are the probabilities that Ylesia does not fall for each skill. So to find out the probability of not falling during the whole routine, we want to calculate the probability of Ylesia not falling the first skill, second skill, and so on. To do so, we would multiply each of the probabilities in the skill_sucess array. So given the for loop, variable p is the probability of not falling and we want to multiply it with prob_no_falls each iteration. This leads to prob_no_falls * p as the answer for blank b. As of blank a, prob_no_falls is a probability and we do not want to use 0 since it will just make the variable 0, 1 will be the correct answer here.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 5.3

Fill the blanks below so that approx_prob_no_falls evaluates to an estimate of the probability that Ylesia performs her entire routine without falling, based on 10,000 trials. Feel free to use the function you defined in part (a) as part of your solution.

        results = np.array([])
        for i in np.arange(10000):
            results = np.append(results, __(a)__)
        approx_prob_no_falls = __(b)__
        approx_prob_no_falls

Answer:(a): count_falls(skill_success), (b): np.count_nonzero(results == 0) / 10000, though there are many other correct solutions

For this question, we are doing a simulation where we calculate the probability of Ylesia not falling during her routine based on 10000 trials. To do so, we want to find out the number of times that Yelsia did not fall any skill during her routine out of the 10000 trials. Based on the given codes, we have an array where we are appending something into that array for each trial. We can utilize the function defined in part a to calculate the number of times Ylesia falls during a single trial so blank a will be count_falls(skill_success). After 10000 iterations, we have an array of the number of falls for each trial. Then, we want to count the number of times that we get 0 in that array, which means Ylesia did not fall. Lastly, to get the probability, we will need to divide by the total number of trials which is 10000. This gives us the answer for blank b: np.count_nonzero(results == 0) / 10000.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 6

Suppose we sample 400 rows of olympians at random without replacement, then generate a 95% CLT-based confidence interval for the mean age of Olympic medalists based on this sample.


Problem 6.1

The CLT is stated for samples drawn with replacement, but in practice, we can often use it for samples drawn without replacement. What is it about this situation that makes it reasonable to still use the CLT despite the sample being drawn without replacement?

Answer: The sample is much smaller than the population.

The Central Limit Theorem (CLT) states that regardless of the shape of the population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size is sufficiently large. It’s often stated in the context of samples drawn with replacement, but in practice, it can still be applicable to samples drawn without replacement under certain conditions. In this context, although the sample is drawn without replacement, the key factor that makes it reasonable to still use the CLT, is the sample size relative to the population size. When the sample size is much smaller than the population size, as in this case where 400 rows of Olympians are sampled from a likely much larger population of Olympians, the effect of sampling without replacement becomes negligible.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.


Problem 6.2

Suppose our 95% CLT-based confidence interval for the mean age of Olympic medalists is [24.9, 26.1]. What was the mean of ages in our sample?

Answer: Mean = 25.5

We calculate the mean by first determining the width of our interval: 26.1 - 24.9 = 1.2, then we divide this width in half to get 0.6 which represents the distance from the mean to each side of the confidence interval. Using this we can find the mean in two ways: 24.9 + 0.6 = 25.5 OR 26.1 - 0.6 = 25.5


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 6.3

Suppose our 95% CLT-based confidence interval for the mean age of Olympic medalists is [24.9, 26.1]. What was the standard deviation of ages in our sample?

Answer: Standard deviation = 6

We can calculate the sample standard deviation (sample SD) by using the 95% confidence interval equation:

\text{sample mean} - 2 * \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2 * \frac{\text{sample SD}}{\sqrt{\text{sample size}}}.

Choose one of the end points and start plugging in all the information you have/calculated:

25.5 - 2*\frac{\text{sample SD}}{\sqrt{400}} = 24.9\text{sample SD} = \frac{(25.5 - 24.9)}{2}*\sqrt{400} = 6.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.



Problem 7

In our sample, we have data on 1210 medals for the sport of gymnastics. Of these, 126 were awarded to American gymnasts, 119 were awarded to Romanian gymnasts, and the remaining 965 were awarded to gymnasts from other nations.

We want to do a hypothesis test with the following hypotheses.

Null: American and Romanian gymnasts win an equal share of Olympic gymnastics medals.

Alternative: American gymnasts win more Olympic gymnastics medals than Romanian gymnasts.


Problem 7.1

Which test statistic could we use to test these hypotheses?

Answer: difference in the number of medals won by American gymnasts and the number of medals won by Romanian gymnasts

To test this pair of hypotheses, we need a test statistic that is large when the data suggests that we reject the null hypothesis, and small when we fail to reject the null. Now let’s look at each option:
- Option 1: Say for example Romanians won half the total medals, Americans won the the other half, and no other country won any medals. In this situation, the Romanians and Americans won an equal amount of medals, but the total variation distance would still be large, suggesting that we reject the null hypothesis, even though them winning an equal amount of medals should suggest that we fail to reject the null hypothesis. Likewise, if Romanians won all the medals and no other country won any medals, the test statistic would still be large, suggesting that we reject the null hypothesis for the alternative hypothesis that Americans win more medals than Romanians, even though in our hypothetical situation Americans won no medals and Romanians won all of them.
- Option 2: This test statistic doesn’t take into account the number of medals Romanians won. Imagine a situation where Romanians won half of all the medals and Americans won the other half, and no other country won any medals. In here, they won the same amount of medals and the test statistic would be 1/2. Now imagine if Americans win half the medals, some other country won the other half, and Romanians won no medals. In this case, the Americans won a lot more medals than Romanians but the test statistic is still 1/2. A good test statistic should point to one hypothesis when it’s large and the other hypothesis when it’s small. In this test statistic, 1/2 points to both hypotheses, making it a bad test statistic.
- Option 3: In this test statistic, when Americans win an equal amount of medals as Romanians, the test statistic would be 0, a very small number. When Americans win way more medals than Romanians, the test statistic is large, suggesting that we reject the null hypothesis in favor of the alternative. You might notice that when Romanians win way more medals than Americans, the test statistic would be negative, suggesting that we fail to reject the null hypothesis that they won equal medals. But recall that failing to reject the null doesn’t necessarily mean we think the null is true, it just means that under our null hypothesis and alternative hypothesis, the null is plausible. The important thing is that the test statistic points to the alternative hypothesis when it’s large, and points to the null hypothesis when it’s small. This test statistic does just that, so option 3 is the correct answer.
- Option 4: This statistic is large when American gymnasts win 100% of the Olympic medals (|1-1/2| = 1/2). It get’s smaller as American gymnasts win an equal amount of medals as Romanian ones (|1/2-1/2=0), but then it gets large again as American gymnasts win less medals than Romanian ones (|1/100-99/100|=98/100). This means that when Americans win way less medals than Romanians, this test statistic would still be large, suggesting that the alternative hypothesis is true, even though the alternative hypothesis is that Americans win more medals than Romanians.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 7.2

Below are four different ways of testing these hypotheses. In each case, fill in the calculation of the observed statistic in the variable observed, such that p_val represents the p-value of the hypothesis test.

Way 1:

        many_stats = np.array([])
        for i in np.arange(10000):
            result = np.random.multinomial(245, [0.5, 0.5]) / 245
            many_stats = np.append(many_stats, result[0] - result[1])
        observed = __(a)__
        p_val = np.count_nonzero(many_stats >= observed)/len(many_stats)

Way 2:

        many_stats = np.array([])
        for i in np.arange(10000):
            result = np.random.multinomial(245, [0.5, 0.5]) / 245
            many_stats = np.append(many_stats, result[0] - result[1])
        observed = __(b)__
        p_val = np.count_nonzero(many_stats <= observed)/len(many_stats)

Way 3:

        many_stats = np.array([])
        for i in np.arange(10000):
            result = np.random.multinomial(245, [0.5, 0.5]) / 245
            many_stats = np.append(many_stats, result[0])
        observed = __(c)__ 
        p_val = np.count_nonzero(many_stats >= observed)/len(many_stats)

Way 4:

        many_stats = np.array([])
        for i in np.arange(10000):
            result = np.random.multinomial(245, [0.5, 0.5]) / 245
            many_stats = np.append(many_stats, result[0])
        observed = __(d)__ 
        p_val = np.count_nonzero(many_stats <= observed)/len(many_stats)

Answer: Way 1: 126/245 - 119/245 or 7/245

First, let’s look at what this code is doing. The line result = np.random.multinomial(245, [0.5, 0.5]) / 245 makes an array of length 2, where each of the 2 elements contains the amount of the 245 total medals corresponding to the amount of medals won by American gymnasts and Romanian gymnasts respectively. We then divide this array by 245 to turn them into proportions out of 245 (which is the sum of 126+119). This array of proportions is then assigned to result. For example, one of our 10000 repetitions could assign np.array([124/245, 121/245]) to result. The following line, many_stats = np.append(many_stats, result[0] - result[1]), appends the difference between the first proportion in result and the second proportion in result to many_stats. Using our example, this would append 124/245 - 121/245 (which equals 3/245) to many_stats. To determine how we calculate the observed statistic, we have to consider how we are calculating the p-value. In order to calculate the p-value, we need to determine how frequent it is to see a result as extreme as our observed statistic, or more extreme in the direction of the alternative hypothesis. The alternative hypothesis states that American gymnasts win more medals than Romanian gymnasts, meaning that we are looking for results in many_stats where the difference is equal to or greater than (more extreme than) 126/245 - 119/245 (which equals 7/245). That is, the final line of code in Way 1 is using np.count_nonzero to find the amount of differences in many_stats greater than 7/245. Therefore, observed must equal 7/245.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

Answer: Way 2: 119/245 - 126/245 or -7/245

The only difference between Way 2 and Way 1 is that in Way 2, the >= is switched to a <=. This causes a result as extreme or more extreme than our observed statistic to now be represented as anything less than or equal to our observed statistic. To account for this, we need to consider the first proportion in result as the number of medals won by Romanian gymnasts, and the second proportion as the number of medals won by American gymnasts. This flips the sign of all of the proportions. So instead of calculating our observed statistic as 126/245-119/245, we now calculate it as 119/245-126/245 (which equals -7/245).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.

Answer: Way 3: 126/245

The difference between way 1 and way 3 is that way 3 is now taking results[0] as its test statistic instead of results[0] - results[1], which represents the number of Olympic gymnastics medals won by American gymnasts. This means that the observed statistic should be the number of medals won by America in the given sample. In that case, the observed statistics will be (# of American medals/ # of American medals + # of Romanian medals) = 126/245


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.

Answer: Way 4: 119/245

Since now the sign is swapped from “>=” in way 3 to “<=” in way 4, results[0] represent the number of Romanian medals won. This is because the alternative hypothesis states that America wins more medals than Romania, demonstrating that the observed statistics is 119/245.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.


Problem 7.3

The four p-values calculated in Ways 1 through 4 are:

Answer: similar, but not necessarily the same.

All of these differences in test statistics and different p-values all are different, however, they are all geared towards testing through the same null and alternative hypothesis. Although they are all different methods, they are all trying to prove the same conclusion.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.



Problem 8


Problem 8.1

In Olympic hockey, the number of goals a team scores is linearly associated with the number of shots they attempt. In addition, the number of goals a team scores has a mean of 10 and a standard deviation of 5, whereas the number of attempted shots has a mean of 30 and a standard deviation of 10.

Suppose the regression line to predict the number of goals based on the number of shots predicts that for a game with 20 attempted shots, 6 goals will be scored. What is the correlation between the number of goals and the number of attempted shots? Give your answer as an exact fraction or decimal.

Answer: \frac{4}{5}

Recall that the formula of the regression line in standard units is y_{su}=r \cdot x_{su}. Since we are predicting # of goals from the # of shots, let x_{su} represent # of shots in standard units and y_{su} represent # of goals in standard units. Using the formula for standard units with information in the problem, we find x_{su}=\frac{20-30}{10}=(-1) and y_{su}=\frac{6-10}{5}=(-\frac{4}{5}). Hence, (-\frac{4}{5})=r \cdot (-1) and r=\frac{4}{5}.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.



Problem 8.2

In Olympic baseball, the number of runs a team scores is linearly associated with the number of hits the team gets. The number of runs a team scores has a mean of 8 and a standard deviation of 4, while the number of hits has a mean of 24 and a standard deviation of 6. Consider the regression line that predicts the number of runs scored based on the number of hits.

  1. What is the maximum possible predicted number of runs for a team that gets 27 hits?

  2. What is the correlation coefficient in the case where the predicted number of runs for a team with 25 hits is as large as possible?

Answer: i) 10

Consider the standard unit regression line again, y_{su}=r \cdot x_{su}. Since we are predicting # of runs from the # of hits, let xsu represent # of hits in standard units and ysu represent # of runs in standard units. In part b, we are hoping to find the maximal y_{su} given the x_{su}. Via formula for standard units, we know x_{su}=\frac{27-24}{6}=\frac{1}{2}. Because x_{su} is positive, we know that to achieve the maximum prediction in y_{su}, the correlation r must also be positive and its largest possible value. Since the value of r must be between -1 and 1, we know that to satisfy the prior condition, r=1. Plugging everything back, we find that y_{su}=1 \cdot \frac{1}{2}. We reverse our operations to find the actual predicted # of runs y (not in standard units). \frac{1}{2}=\frac{y-8}{4} and so y=10.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.

Answer: ii) 1

Keep the same definitions as part b). We find x_{su}=\frac{25-24}{6}=\frac{1}{6}. Because x_{su} is positive, we know that to achieve the maximum prediction in y_{su}, the correlation r must also be positive and its largest possible value. Again, since -1<=r<=1, we know that to satisfy the prior condition, r=1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.



Problem 9

In 2024, the Olympics will include breaking (also known as breakdancing) for the first time. The breaking competition will include 16 athletes, who will compete in a single-elimination tournament.

In the first round, all 16 athletes will compete against an opponent in a face-to-face “battle". The 8 winners, as determined by the judges, will move on to the next round. Elimination continues until the final round contains just 2 competitors, and the winner of this final battle wins the tournament.

The table below shows how many competitors participate in each round:

After the 2024 Olympics, suppose we make a DataFrame called breaking containing information about the performance of each athlete during each round. breaking will have one row for each athlete’s performance in each round that they participated. Therefore, there will be 16+8+4+2 = 30 rows in breaking.

In the "name" column of breaking, we will record the athlete’s name (which we’ll assume to be unique), and in the other columns we’ll record the judges’ scores in the categories on which the athletes will be judged (creativity, personality, technique, variety, performativity, and musicality).


Problem 9.1

How many rows of breaking correspond to the winner of the tournament? Give your answer as an integer.

Answer: 4

Since the winner of the tournament must have won during the 1st, 2nd, 3rd, and final rounds, there will be a total of four rows in breaking corresponding to this winner.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 9.2

How many athletes’ names appear exactly twice in the "name" column of breaking? Give your answer as an integer.

Answer: 4

For an athlete to appear on exactly two rows in breaking, they must get through the 1st round but get eliminated in the 2nd round. There are a total of 8 athletes in the 2nd round, of which 4 are eliminated.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 9.3

\bigstar If we merge breaking with itself on the "name" column, how many rows will the resulting DataFrame have? Give your answer as an integer.

Hint: Parts (a) and (b) of this question are relevant to part (c).

Answer: 74

Let’s consider 4 separate cases, the athletes who make it into the final round, the athletes who are eliminated in the 3rd round, the athletes who are eliminated in the 2nd round, and the athletes who are eliminated in the 1st round. There are two athletes in the final round, and both of their names appear four times. When merging breaking with itself, these two athletes will appear 16 times each in the merged dataframe (4 \cdot 4). Two athletes are eliminated in the 3rd round, and their names each appear 3 times in breaking. When merging breaking with itself, these two athletes will appear 9 times each in the merged dataframe (3 \cdot 3). Four athletes are eliminated in the 2nd round, and their names each appear twice in breaking. When merging breaking with itself, these four athletes will appear 4 times each in the merged dataframe (2 \cdot 2). Eight athletes are eliminated in the 1st round, and their names each appear once in breaking. When merging breaking with itself, these eight athletes will appear 1 time each in the merged dataframe (1 \cdot 1). Now, we can add these numbers all up, 16 \cdot 2 + 9 \cdot 2 + 4 \cdot 4 + 8 \cdot 1 = 74.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.


Problem 9.4

Recall that the number of competitors in each round is 16, 8, 4, 2. Write one line of code that evaluates to the array np.array([16, 8, 4, 2]). You must use np.arange in your solution, and you may not use np.array or the DataFrame breaking.

Answer: 2 ** np.arange(4, 0, -1)

We notice that 16 = 2^4, 8 = 2^3, 4 = 2^2, and 2 = 2^1. We can use this pattern and write an expression in the form of 2 ** (something). [4, 3, 2, 1] can be generated by np.arange as follows: np.arange(4, 0, -1). 4 is the starting number, 0 is the ending number (exclusive), and -1 is the step size.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 38%.



Problem 10

We want to use the sample of data in olympians to estimate the mean age of Olympic beach volleyball players.


Problem 10.1

Which of the following distributions must be normally distributed in order to use the Central Limit Theorem to estimate this parameter?

Answer: None of the Above

The central limit theorem states that the distribution of possible sample means and sample sums is approximately normal, no matter the distribution of the population. Options A, B, and C are not probability distributions of the sum or mean of a large random sample draw with replacement.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 10.2

(10 pts) Next we want to use bootstrapping to estimate this parameter. Which of the following code implementations correctly generates an array called sample_means containing 10,000 bootstrapped sample means?

Way 1:

    sample_means = np.array([])
    for i in np.arange(10000):
        bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
        one_mean = (bv.sample(bv.shape[0], replace=True)
                      .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 2:

    sample_means = np.array([])
    for i in np.arange(10000):
        bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
        one_mean = (olympians.sample(olympians.shape[0], replace=True)
                             .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 3:

    sample_means = np.array([])
    for i in np.arange(10000):
        resample = olympians.sample(olympians.shape[0], replace=True)
        bv = resample[resample.get("Sport") == "Beach Volleyball"]
        one_mean = bv.get("Age").mean()
        sample_means = np.append(sample_means, one_mean)

Way 4:

    sample_means = np.array([])
    bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
    for i in np.arange(10000):
        one_mean = (bv.sample(bv.shape[0], replace=True)
                      .get("Age").mean())
        sample_means = np.append(sample_means, one_mean)

Way 5:

    sample_means = np.array([])
    bv = olympians[olympians.get("Sport") == "Beach Volleyball"]
    one_mean = (bv.sample(bv.shape[0], replace=True)
                  .get("Age").mean())
    for i in np.arange(10000):
        sample_means = np.append(sample_means, one_mean)

Answer: Way 1 and Way 4

  • Way 1 first creates a dataframe bv, which is a dataframe of olympians but filtered to only have Beach Voleyball. It then samples from bv with replacement, and counts the mean of that sample and stores it in the variable one_sample. It does this 10000 times (due to the for loop), each time creating a dataframe bv, sampling from it, calculating a mean, and then appending one_sample to the array sample_means. This is a correct way to bootstrap.
  • Way 2 is incorrect because it calculates one_mean using a sample from the entire olympians DataFrame, instead of the bv DataFrame with only the “Beach Volleyball” players. This will result in a sample of players of any sport, and the mean of those ages will be calculated.
  • Way 3 is incorrect because it queries for rows where “Sport” equals “Beach Volleyball” after sampling from the DataFrame. This would lead to a mean that is not representative of a sample of volleyball players’ ages because we are sampling from all the rows with all different sports. There would be inconsistent rows of “Beach Volleyball” players in each sample.
  • Way 4 is basically the same as Way 1, except it creates the dataframe bv before the for loop. The dataframe bv will always be the same, so it doesn’t really matter if we make bv before or after the for loop.
  • Way 5 is incorrect because the one_mean is calculated only once, but is appended to the sample_means array 10,000 times. As a result, the same mean is being appended, instead of a different mean being calculated and appended each iteration.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 10.3

For most of the answer choices in part (b), we do not have enough information to predict how the standard deviation of sample_means would come out. There is one answer choice, however, where we do have enough information to compute the standard deviation of sample_means. Which answer choice is this, and what is the standard deviation of sample_means for this answer choice?

Answer: Way 5

Way 5 results in a sample_means array with the same mean appended 10,000 times. As a result the standard deviation would be 0 because the entire array would be the same value repeated.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 10.4

There are 68 rows of olympians corresponding to beach volleyball players. Assume that in part (b), we correctly generated an array called sample_means containing 10,000 bootstrapped sample mean ages based on this original sample of 68 ages. The standard deviation of the original sample of 68 ages is approximately how many times larger than the standard deviation of sample_means? Give your answer to the nearest integer.

Answer: 8

Recall SD of sample_means = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}. The sample size equals 68. Based on this equation, the population SD is \sqrt{68} times larger than the SD of distribution of possible sample means. \sqrt{68} rounded to the nearest integer is 8.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Problem 11

Aladár Gerevich is a Hungarian fencer who is one of only two men to win Olympic medals 28 years apart. He earned 10 Olympic medals in total throughout his career: 7 gold, 1 silver, and 2 bronze. The table below shows the distribution of medal types for Aladár Gerevich, as well as a few other athletes who also earned 10 Olympic medals.


Problem 11.1

Which type of data visualization is most appropriate to compare two athlete’s medal distributions?

Answer: overlaid bar chart

Here we’re plotting three continuous variables (proportion of gold, silver, and bronze medals) for each athlete. We know a line plot is inappropriate here, as these are almost always used for continuous variables plotted over time (not according to categorical variables, such as athletes). A histogram also isn’t useful, as histograms display the distribution of a variable using bars. In this case, we already have the proportions we’re interested in plotting, so we can just make one (overlaid) bar for each medal type, for each athlete, using an overlaid bar chart.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 11.2

Among the other athletes in the table above, whose medal distribution has the largest total variation distance (TVD) to Aladár Gerevich’s distribution?

Answer: Franziska van Almsick

The Total Variation Distance (TVD) of two categorical distributions is the sum of the absolute differences of their proportions, all divided by 2. We can apply the TVD formula to these distributions: The tvd between Katie Ledecky and Aladar Gerevich is given by \frac{1}{2} \cdot (|0.7 - 0.7| + |0.1 - 0.3| + |0.2 - 0|) = \frac{0.4}{2} = 0.2. The TVD between Alexander Dityatin and Aladar Gerevich is given by \frac{1}{2} \cdot (|0.7 - 0.3| + |0.1 - 0.6| + |0.2 - 0.1|) = \frac{1}{2} = 0.5. And finally, the TVD between Franziska van Almsick and Aladar gerevich is given by \frac{1}{2} \cdot (|0.7 - 0| + |0.1 - 0.4| + |0.2 - 0.6|) = \frac{1.4}{2} = 0.7. So, Franziska van Almsick has the largest TVD to Gerevich’s distribution.


Difficulty: ⭐️

The average score on this problem was 92%.


Problem 11.3

Suppose Pallavi earns 10 Olympic medals in such a way that the TVD between Pallavi’s medal distribution and Aladár Gerevich’s medal distribution is as large as possible. What is Pallavi’s medal distribution?

Answer: x=0, y=1, z=0

Intuitively, can maximize the TVD between the distributions by putting all of Pallavi’s medals in the category which Gerevich won the least of, so x = 0, y = 1, z = 0. Moving any of these medals to another category would necessarily decrease the TVD, since that would mean that all of Pallavi’s medal proportions would get closer to Gerevich’s (Silver is decreasing, getting closer, and gold and bronze are increasing, which makes them closer as well).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 11.4

More generally, suppose medal_dist is an array of length three representing an athlete’s medal distribution. Which of the following expressions gives the maximum possible TVD between medal_dist and any other distribution?

Answer: 1 - medal_dist.min()

Similar to part c, we know that the TVD is maximized by placing all the medals of competitor A into the category in which competitor B has the lowest proportion of medals. If we place all of competitor A’s medals into this bin, the difference between the two distributions for this variable will be 1 - medal_dist.min() In the other bins, competitor A has no medals (making all their values 0), and competitor B has the remainder of their medals, which is 1 - medal_dist.min(). So, in total, the TVD is given by \frac{1}{2} \cdot 2 \cdot 1 - medal_dist.min() = 1 - medal_dist.min().


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.



Problem 12

Consider the DataFrame olympians.drop(columns=["Medal", "Year", "Count"]).


Problem 12.1

State a question we could answer using the data in this DataFrame and a permutation test.

Answer: There are many possible answers for this question. Some examples: “Are male olympians from team USA significantly taller than male olympians from other countries?”, “Are olympic swimmers heavier than olympic figure skaters?”, “On average, are male athletes heavier than female athletes?”

Recall that a permutation test is basically trying to test if two variables come from the same distribution or if the difference between those two variables are so significant that we can’t possibly say that they’re from the same distribution. In general, this means the question would have to involve the age, height, or weight column.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.



Problem 12.2

State the null and alternative hypotheses for this permutation test.

Answer: Also many possible answers depending on your answer for the first question.

For example, if our question was “Are olympic swimmers heavier than olympic figure skaters?”. Then the null hypothesis could be “Olympic swimmers weigh the same as olympic figure skaters” and the alternative could be “olympic swimmers weigh more than figure skaters”


Difficulty: ⭐️⭐️

The average score on this problem was 80%.



Problem 13

In our sample, we have data on 163 medals for the sport of table tennis. Based on our data, China seems to really dominate this sport, earning 81 of these medals.

That’s nearly half of the medals for just one country! We want to do a hypothesis test with the following hypotheses to see if this pattern is true in general, or just happens to be true in our sample.

Null: China wins half of Olympic table tennis medals.

Alternative: China does not win half of Olympic table tennis medals.


Problem 13.1

Why can these hypotheses be tested by constructing a confidence interval?

Answer: Since the test aims to determine whether a parameter is equal to a fixed value

The goal of a confidence interval is to provide a range of values that, given the data, are considered plausible for the parameter in question. If the null hypothesis’ fixed value does not fall within this interval, it suggests that the observed data is not very compatible with the null hypothesis. Thus in our case, if a 95% confidence interval for the proportion of medals won by China does not include ~0.5, then there’s statistical evidence at the 5% significance level to suggest that China does not win exactly half of the medals. So again in our case, confidence intervals work to test this hypothesis because we are attempting to find out whether or half of the medals (0.5) lies within our interval at the 95% confidence level.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.



Problem 13.2

Suppose we construct a 95% bootstrapped CI for the proportion of Olympic table tennis medals won by China. Select all true statements.

Answer: If we resampled our original sample and calculated the proportion of Olympic table tennis medals won by China in our resample, there is approximately a 95% chance our interval would contain this number.

The second option is the only correct answer because it accurately describes the process and interpretation of a bootstrap confidence interval. A 95% bootstrapped confidence interval means that if we repeatedly sampled from our original sample and constructed the interval each time, approximately 95% of those intervals would contain the true parameter. This statement does not imply that the true proportion has a 95% chance of falling within any single interval we construct; instead, it reflects the long-run proportion of such intervals that would contain the true proportion if we could repeat the process indefinitely. Thus, the confidence interval gives us a method to estimate the parameter with a specified level of confidence based on the resampling procedure.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 13.3

True or False: In this scenario, it would also be appropriate to create a 95% CLT-based confidence interval.

Answer: True

The statement is true because the Central Limit Theorem (CLT) applies to the sampling distribution of the proportion, given that the sample size is large enough, which in our case, with 163 medals, it is. The CLT asserts that the distribution of the sample mean (or proportion, in our case) will approximate a normal distribution as the sample size grows, allowing the use of standard methods to create confidence intervals. Therefore, a CLT-based confidence interval is appropriate for estimating the true proportion of Olympic table tennis medals won by China.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.



Problem 13.4

True or False: If our 95% bootstrapped CI came out to be [0.479, 0.518], we would reject the null hypothesis at the 0.05 significance level.

Answer: False

This is false, we would fail to reject the null hypothesis because the interval [0.479, 0.518] includes the value of 0.5, which corresponds to the null hypothesis that China wins half of the Olympic table tennis medals. If the confidence interval contains the hypothesized value, there is not enough statistical evidence to reject the null hypothesis at the specified significance level. In this case, the data does not provide sufficient evidence to conclude that the proportion of medals won by China is different from 0.5 at the 0.05 significance level.


Difficulty: ⭐️

The average score on this problem was 92%.



Problem 13.5

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, the confidence interval we’d create would be wider.

Answer: True

Lowering the significance level means that you require more evidence to reject the null hypothesis, thus seeking a higher confidence in your interval estimate. A higher confidence level corresponds to a wider interval because it must encompass a larger range of values to ensure that it contains the true population parameter with the increased probability. Thus as we lower the significance level, the interval we create will be wider, making this statement true.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 13.6

True or False: If we instead chose to test these hypotheses at the 0.01 significance level, we would be more likely to conclude a statistically significant result.

Answer: False

This statement is false. Given that in the previous problem we showcased that lowering our significance level will make the confidence interval wider, this statement is false because the true parameter was already contained within the tighter interval. Thus, lowering our significance level will only make the interval wider which already contained the true parameter, which would not give us statistically significant results. Thus, this answer is false.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.



Problem 14


Problem 14.1

Suppose that in Olympic ski jumping, ski jumpers jump off of a ramp that’s shaped like a portion of a normal curve. Drawn from left to right, a full normal curve has an inflection point on the ascent, then a peak, then another inflection point on the descent. A ski jump ramp stops at the point that is one third of the way between the inflection point on the ascent and the peak, measured horizontally. Below is an example ski jump ramp, along with the normal curve that generated it.

Fill in the blank below so that the expression evaluates to the area of a ski jump ramp, if the area under the normal curve that generated it is 1.

    from scipy import stats
    stats.norm.cdf(______)

What goes in the blank?

Answer: -2/3

We know that the normal distribution is symmetric about the mean, and that the mean is the “peak” described in the graph. The inflection points occur one standard deviation above and below the mean (the peak), so a point which is one third of the way in between the first inflection point and the peak is -(1-\frac{1}{3}) = -\frac{2}{3} standard deviations from the mean. We can then use stats.norm.cdf(-2/3) to calculate the area under the curve to the left of this point.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.


Problem 14.2

Suppose that in Olympic downhill skiing, skiers compete on mountains shaped like normal distributions with mean 50 and standard deviation 8. skiers start at the peak and ski down the right side of the mountain, so their x-coordinate increases.

Keenan is an Olympic downhill skier, but he’s only been able to practice on a mountain shaped like a normal distribution with mean 65 and standard deviation 12. In his practice, Keenan always crouches down low when he reaches the point where his x-coordinate is 92, which helps him ski faster. When he competes at the Olympics, at what x-coordinate should he crouch down low, corresponding to the same relative location on the mountain?

Answer: 68

Since we know that both slopes are normal distributions (just scaled and shifted), we can derive this answer by writing Keenan’s crouch point in terms of standard deviations from the mean. He typically crouches at 92 feet, whose distance from the mean (in standard deviations) is given by \frac{92 - 65}{12} = 2.25. So, all we need to do is find what number is 2.25 standard deviations from the mean in the Olympic mountain. This is given by 50 + (2.25 * 8) = 68


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 14.3

Aaron is another Olympic downhill skier. When he competes on the normal curve mountain with mean 50 and standard deviation 8, he crouches down low when his x-coordinate is 54. If the total area of the mountain is 1, approximately how much of the mountain’s area is ahead of Aaron at the moment he crouches down low?

Answer: 0.3

We know that when Aaron reaches the mean (50), exactly 0.5 of the mountain’s area is behind him, since the mean and median are equal for normal distributions like this one. We also see that 54 is one half of a standard deviation away from the mean. So, all we have to do is find out what proportion of the area is within half a standard deviation of the mean. This number happens to be approximately 0.19, so by the time Aaron reaches an x-coordinate of 54, 0.5 + 0.19 = 0.69 of the mountain is behind him. From here, we simply calculate the area in front by 1 - 0.69 = 0.31, so we conclude that approximately 0.3 of the area is in front of Aaron.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.



Problem 15

Birgit Fischer-Schmidt is a German canoe paddler who set many records, including being the only woman to win Olympic medals 24 years apart.

Below is a DataFrame with information about all 12 Olympic medals she has won. There are only 10 rows but 12 medals represented, because there are some years in which she won more than one medal of the same type.


Problem 15.1

Suppose we randomly select one of Birgit’s Olympic medals, and see that it is a gold medal. What is the probability that the medal was earned in the year 2000? Give your answer as a fully simplified fraction.

Answer: \frac{1}{4}

Reading the prompt we can see that we are solving for a conditional probability. Let A be the given condition that the medal is gold and let B be the event that a medal is from 2000. Looking at the dataframe we can see that 8 total gold medals are earned (make sure you pay attention to the count column). Out of these 8 medals, 2 of them are from the year 2000. Thus we obtain the probability 2/8 or ¼.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 15.2

Suppose we randomly select one of Birgit’s Olympic medals. What is the probability it is gold or earned while representing East Germany? Give your answer as a fully simplified fraction.

Answer: \frac{3}{4}

Here we can recognize that we are solving for the probability of a union of two events. Let A be the event that the medal is gold. Let B be the event that it is earned while representing East Germany. The probability formula for a union is P(A∪B) = P(A)+P(B)-P(A∩B). Looking at the DataFrame, we know P(A)=8/12, P(B)=4/12, and P(A∩B)=3/12. Plugging all of this into the formula, we get \frac{8}{12}+\frac{4}{12}-\frac{3}{12}=\frac{9}{12}=\frac{3}{4}. Thus, the correct answer is \frac{3}{4}.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 15.3

Suppose we randomly select two of Birgit’s Olympic medals, without replacement. What is the probability both were earned in 1988? Give your answer as a fully simplified fraction.

Answer: \frac{1}{22}

In this problem, we are sampling 2 medals without replacement. Let A be the event that the first medal is from 1988 and let P(B) be the event that the second medal is from 1988. P(A) is 3/12 since there are 3 medals from 1988 out of the total 12 medals. However, in the second trial, since we are sampling without replacement, the medal that we just sampled is no longer in our pool. Thus, P(B) is now 2/11 since there are now 2 medals from 1988 from the remaining 11 total medals. The joint probability P(A∩B) is P(A)P(B) so, plugging these values in, we get (3/12)*(2/12) = 3/66=1/22. Thus, the answer is 1/22.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 16

Suppose males is a DataFrame of all male Olympic weightlifting medalists with a column called "Weight" containing their body weight. Similarly, females is a DataFrame of all female Olympic weightlifting medalists. It also has a "Weight" column with body weights.

The males DataFrame has 425 rows and the females DataFrame has 105 rows, since women’s weightlifting became an Olympic sport much later than men’s.

Below, density histograms of the distributions of "Weight" in males and females are shown on the same axes:


Problem 16.1

Estimate the number of males included in the third bin (from 80 to 100). Give your answer as an integer, rounded to the nearest multiple of 10.

Answer: 110

We can estimate the number of males included in the third bin (from 80 to 100) by multiplying the area of that particular histogram bar by the total number of males. The bar has a height of around 0.013 and a width of 20, so the bar has an area of 0.013 * 20 = 0.26, which means 26% of males fall in that bin. Since there are 425 total males, 0.26 * 425 ≈ 110 males are in that bin, rounded to the nearest multiple of 10.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Problem 16.2

Using the males DataFrame, write one line of code that evaluates to the exact number of males included in the third bin (from 80 to 100).

Answer: males[(males.get("Weight")>=80) & (males.get("Weight")<100)].shape[0]

To get the exact number of males in the third bin (from 80 to 100) using code, we can query the males DataFrame to only include rows where 'Weight' is greater than or equal to 80, and 'Weight' is less than 100. Remember, a histogram bin includes the lower bound (in this case, 80) and excludes the upper bound (in this case, 100). Furthermore, we should use & instead of and because pandas and babypandas require bitwise comparison. Finally, remember to put parentheses around each condition, or else the order of operations will mess up your intended conditions. After querying, we use .shape[0] to get the number of rows in that dataframe, therefore getting the number of males with a weight greater than or equal to 80, and less than 100.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.



Problem 16.3

\bigstar Among Olympic weightlifting medalists who weigh less than 60 kilograms, what proportion are male?

Answer: between 0.5 and 0.75

We can answer this question by calculating the approximate number of males and females in the first bin (from 40 to 60). Be careful, we cannot simply compare the areas of the bars because the number of male weightlifting medalists is different from the number of female weightlifting medalists. The approximate number of male weightlifting medalists = 0.008 * 20 * 425 = 68, while the approximate number of female weightlifting medalists = 0.021 * 20 * 105 = 44. The proportion of male weightlifting medalists who weigh less than 69 kg is approximately 68 / (68 + 44) = 0.607, which falls in the category of “between 0.5 and 0.75”


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.