Lecture 15 — Practice

← return to practice.dsc10.com


Lecture 15 — Collected Practice Questions

Below are practice problems tagged for Lecture 15 (rendered directly from the original exam/quiz sources).


Source: sp24-final — Q4

Problem 1

According to Chebyshev’s inequality, at least 80% of San Diego apartments have a monthly parking fee that falls between $30 and $70.


Problem 1.1

What is the average monthly parking fee?

Answer: \$50

We are given that the left and right bounds of Chebyshev’s inequality are $30 and $70 respectively. Thus, to find the middle of the two, we compute the following equation (the midpoint equation):

\frac{\text{right} + \text{left}}{2}

\frac{70 + 30}{2} = 50

Therefore, 50 is the average monthly parking fee.


Difficulty: ⭐️

The average score on this problem was 92%.


Problem 1.2

What is the standard deviation of monthly parking fees?

Answer: \frac{20}{\sqrt{5}}

Chebyshev’s inequality states that at least 1 - \frac{1}{z^2} of values are within z standard deviations of the mean. In addition, z can be represented as \frac{\text{bound} - \text{mean of x}}{\text{SD of x}}.

Therefore, we can set up the equation like so: \frac{4}{5} = 1 - \frac{1}{(\frac{\text{bound} - \text{mean of x}}{\text{SD of x}})^2}

Then, we can solve: \frac{1}{5} = \frac{1}{(\frac{\text{bound} - \text{mean of x}}{\text{SD of x}})^2}

Now since we know both bounds, we can plug one of them in. Since the mean was computed in the earlier step, we also plug this in.

\frac{1}{5} = \frac{1}{(\frac{70 - 50}{\text{SD of x}})^2} 5 = (\frac{20}{\text{SD of x}})^2 \sqrt{5} = \frac{20}{\text{SD of x}} \text{SD of x} = \frac{20}{\sqrt{5}}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.



Source: sp25-final — Q1

Problem 2

In an annual ceremony known as the reaping, tributes are selected to represent their district in the Hunger Games. One male and one female tribute from each district are randomly selected via a lottery drawing.

Every child between the ages of 12 and 18 (inclusive) has tickets entered into the drawing for their sex and district (e.g. girls from District 12). The number of tickets entered is dependent on age.

Starting at age 12, each child receives one ticket in the lottery. For each year after that, they receive one additional ticket, added to the total from the previous year. For example, 13-year-olds have two tickets, 14-year-olds have three tickets, and so on.

In this problem, we will consider only tickets corresponding to girls from District 12, and look at the distribution of these tickets according to the age of the person they represent. A density histogram for these tickets is shown below.


Problem 2.1

Which of the following statements about this distribution is correct?

Answer: The mean is less than the median

The histogram shows us that most of the tickets are for older girls i.e. girls that are of ages 17 to 19. It also shows us that there are fewer tickets for the younger girls. When most of the values are larger, the median is larger than the mean because the small values tend to pull the mean down.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


The histogram from the previous page is repeated below for your reference.


Problem 2.2

Suppose the rules of the Hunger Games were changed to eliminate 18-year-olds. If we plotted a new density histogram of the distribution of ages for tickets corresponding to girls from District 12 aged 12 to 17, how would the height of the [13, 14) bar change?

Let h be the height of this bar in the original histogram. Give its height in the new histogram in terms of h.

Answer: 4/3 * h

The total area under a density histogram is 1. Using this bit of information, when the 18 year olds are removed we have to scale the the remaining bars so that the area of our density histogram is still 1. Notice the height of the exempt bar is 0.25 meaning that the remaining data is exactly 3/4 th of the original data. To rescale we need to divide the height h of each bar by 3/4 or multiply h by 4/3.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.



Problem 2.3

What is the most common age among girls from District 12 aged 12 to 18? Remember, the distribution above is for all tickets, and older girls have more tickets.

Answer: 15

For this problem we need to calculate which bar has the most amount of girls within the bar keeping in mind that each age group gets a different amount of tickets. To do this we can calculate the proportion of girls in each bar relative to the total ticket distribution by dividing the height of each bar by the number of tickets allocated to that bar. The bar for age 15 comes out to the largest with a value of 0.045, and therefore that is our solution.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.



Source: sp25-final — Q5

Problem 3

The night before the Hunger Games begins, each tribute is interviewed in front of a live audience. During this interview, the host asks each tribute a few personal questions and reveals their overall score from the training camp. These interviews are broadcast across the country, so that the residents of Panem can get to know the tributes better and form opinions about who they want to win.

The Capitol wants to understand public perceptions of the tributes after the interviews for the 74th Hunger Games. They conduct a survey of a sample of residents from all 12 districts, asking them two questions:

  1. “What district do you live in?"

  2. “Who do you think will win this year’s Hunger Games?"

The survey results are in the DataFrame survey, with columns "District" and "Tribute" which contain each person’s answers to the two questions above. The first few rows of survey are shown below.

In this problem, we will try to estimate the proportion of residents from a given district who think a certain tribute will win the Hunger Games.


Problem 3.1

What proportion of residents in District 11 think Peeta will win? Write one line of code that evaluates to this proportion in our sample, based on the data in survey.

Answer: survey[(survey.get("Tribute") == "Peeta") & (survey.get("District") == 11)].shape[0] / survey[survey.get("District") == 11].shape[0]

This question is just a whole lot of querying. For the numerator we want all the people who answered the survey who are from district 11 and votes for Peeta. We can do this by querying on those two conditions and taking the shape. For the denominator we want all the people from district 11 who answered the survey, so we query for that in the denominator and take the shape.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 3.2

Next, we want to create a 95% confidence interval for the proportion of all residents from a given district who think a certain tribute will win. Fill in the blanks in the function win_CI below. This function takes the name of a tribute and the number of a district and returns the endpoints of a 95% bootstrapped confidence interval for the proportion of all residents of that district who think that tribute will win, based on the data in survey.

For example win_ci("Peeta", 11) returns the endpoints of a 95% confidence interval for the proportion of all residents from District 11 who think Peeta will win.

def win_ci(tribute, district):
            only_district = survey[survey.get("District") == district]
            props = np.array([])
            for i in np.arange(10000):
                resample = __(a)__
                tribute_count = __(b)__
                boot_prop = tribute_count / __(c)__
                props = np.append(props, boot_prop)
            return [np.percentile(props, 2.5), np.percentile(props, 97.5)]

(a): only_district.sample(only_district.shape[0], replace=True)

For the first blank we have to create a bootstrapped sample from just the rows in the given district. We sample with replacement here as we do when we bootstrap to keep the same number of rows. That being said we use the .sample function with replacement to get our sample from the only_district dataframe containing the rows in the given district. Within our sample we want the number of rows to be the same size as the only_district dataframe. so we set the size argument to be only_district.shape[0].

(b): resample[resample.get("Tribute") == tribute].shape[0]

Now we want to find how many times the given tribute appears in the bootstrapped sample. To do that we query the dataframe for the given tribute and the take the size of our query using .shape[0].

(c): resample.shape[0]

The denominator of our resample is just the total number of people in the resample. That being said to fill this blank all we need to do is use .shape[0] to take the size of the resample.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 3.3

Suppose we were to plot a histogram of props within the function win_CI. Which of the following best describes this histogram?

Answer: The histogram is roughly normal because of the Central Limit Theorem (CLT).

The props histogram shows the ditribution of proportions from a bunch of random resamples. Per the CLT, the distribution of sample stats like proportions will be basically normal, regardless of the shape of the original dataset.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.


Problem 3.4

Suppose we now compute the following:

win_ci("Katniss", 4)

[0.25, 0.72]

win_ci("Katniss", 12)

[0.50, 0.70]

Which of the following reasons best explains why the second interval is narrower than the first?

Answer: There are more survey participants from District 12 than District 4.

Confidence intervals get narrower when there is an increase in sample size. This is because the variation present in the bootstrapped estimates is smaller. Therefore, we can say there were more survey participants from District 12 than District 4.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


Problem 3.5

Suppose we want to redo our survey so that our confidence interval for the proportion of District 12 residents who think Katniss will win has a width of at most 0.10. We will assume that our new sample’s standard deviation will be the same as our original sample’s standard deviation. Which of the following best describes how to achieve this?

Answer: Our new sample should have four times as many people overall. It doesn’t matter how many of them are from District 12.

The width of our confidence interval is determined by the standard error which decreases at the factor of \frac{1}{\sqrt(n)} to half the width, we need the denominator to increase by a factor of 2. Therefore, we need 4x more data, as the square root of 4 is 2.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.



Problem 4

The data visualization below shows all Olympic gold medals for women’s gymnastics, broken down by the age of the gymnast.

Based on this data, rank the following three quantities in ascending order: the median age at which gold medals are earned, the mean age at which gold medals are earned, the standard deviation of the age at which gold medals are earned.

Answer: SD, median, mean

The standard deviation will clearly be the smallest of the three values as most of the data is encompassed between the range of [14-26]. Intuitively, the standard deviation will have to be about a third of this range which is around 4 (though this is not the exact standard deviation, but is clearly much less than the mean and median with values closer to 19-25). Comparing the median and mean, it is important to visualize that this distribution is skewed right. When the data is skewed right it pulls the mean towards a higher value (as the higher values naturally make the average higher). Therefore, we know that the mean will be greater than the median and the ranking is SD, median, mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 5

The data visualization below shows all Olympic gold medals for women’s gymnastics, broken down by the age of the gymnast.


Problem 5.1

Based on this data, rank the following three quantities in ascending order: the median age at which gold medals are earned, the mean age at which gold medals are earned, the standard deviation of the age at which gold medals are earned.

Answer: SD, median, mean

The standard deviation will clearly be the smallest of the three values as most of the data is encompassed between the range of [14-26]. Intuitively, the standard deviation will have to be about a third of this range which is around 4 (though this is not the exact standard deviation, but is clearly much less than the mean and median with values closer to 19-25). Comparing the median and mean, it is important to visualize that this distribution is skewed right. When the data is skewed right it pulls the mean towards a higher value (as the higher values naturally make the average higher). Therefore, we know that the mean will be greater than the median and the ranking is SD, median, mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 5.2

Which of the following is larger for this dataset?

Answer: the difference between the 75th percentile of ages and the 50th percentile of ages

Since the distribution is right skewed, the 75th percentile will have a larger difference from the 50th percentile than the 25th percentile. With right skewness, values above the 50th percentile will be more different than those smaller than the 50th percentile (and thus more spread out according to the graph).


Difficulty: ⭐️⭐️

The average score on this problem was 78%.