Lecture 16 — Practice

← return to practice.dsc10.com


Lecture 16 — Collected Practice Questions

Below are practice problems tagged for Lecture 16 (rendered directly from the original exam/quiz sources).


Problem 1

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of several species of penguins in a region of Antarctica.

One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.


Problem 1.1

Consider the histogram of mass below.


Select the true statement below.

Answer: The median mass of penguins is less than the average mass of penguins

This is a distribution that is skewed to the right, so mean is greater than median.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 1.2

For your convenience, we show the histogram of mass again below.


Recall, there are 330 penguins in our dataset. Their average mass is 4200 grams, and the standard deviation of mass is 840 grams.

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 3276 grams and 5124 grams? Input your answer as a percentage between 0 and 100, without the % symbol. Round to three decimal places.

Answer: 17.355

Recall, Chebyshev’s inequality states that No matter what the shape of the distribution is, the proportion of values in the range “average ± z SDs” is at least 1 - \frac{1}{z^2}.

To approach the problem, we’ll start by converting 3276 grams and 5124 grams to standard units. Doing so yields \frac{3276 - 4200}{840} = -1.1, similarly, \frac{5124 - 4200}{840} = 1.1. This means that 3276 is 1.1 standard deviations below the mean, and 5124 is 1.1 standard deviations above the mean. Thus, we are calculating the proportion of values in the range “average ± 1.1 SDs”.

When z = 1.1, we have 1 - \frac{1}{z^2} = 1 - \frac{1}{1.1^2} \approx 0.173553719, which as a percentage rounded to three decimal places is 17.355\%.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 1.3

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 1680 grams and 5880 grams?

Answer: 75%

Recall: proportion with z SDs of the mean

Percent in Range All Distributions (via Chebyshev’s Inequality) Normal Distributions
\text{average} \pm 1 \ \text{SD} \geq 0\% \approx 68\%
\text{average} \pm 2\text{SDs} \geq 75\% \approx 95\%
\text{average} \pm 3\text{SDs} \geq 88\% \approx 99.73\%

To approach the problem, we’ll start by converting 3276 grams and 5124 grams to standard units. Doing so yields \frac{1680 - 4200}{840} = -3, similarly, \frac{5880 - 4200}{840} = 2. This means that 1680 is 3 standard deviations below the mean, and 5880 is 2 standard deviations above the mean.

Proportion of values in [-3 SUs, 2 SUs] >= Proportion of values in [-2 SUs, 2 SUs] >= 75% (Since we cannot assume that the distribution is normal, we look at the All Distributions (via Chebyshev’s Inequality) column for proportion).

Thus, at least 75% of the penguins have a mass between 1680 grams and 5880 grams.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 1.4

The distribution of mass in grams is not roughly normal. Is the distribution of mass in standard units roughly normal?

Answer: No

The shape of the distribution does not change since we are scaling the x values for all data.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.


Problem 1.5

Suppose boot_means is an array of the resampled means. Fill in the blanks below so that [left, right] is a 68% confidence interval for the true mean mass of penguins.

left = np.percentile(boot_means, __(a)__)
right = np.percentile(boot_means, __(b)__)
[left, right]

What goes in blank (a)? What goes in blank (b)?

Answer: (a) 16 (b) 84

Recall, np.percentile(array, p) computes the pth percentile of the numbers in array. To compute the 68% CI, we need to know the percentile of left tail and right tail.

left percentile = (1-0.68)/2 = (0.32)/2 = 0.16 so we have 16th percentile

right percentile = 1-((1-0.68)/2) = 1-((0.32)/2) = 1-0.16 = 0.84 so we have 84th percentile


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 1.6

Which of the following is a correct interpretation of this confidence interval? Select all that apply.

Answer: Option 4 (If we created many confidence intervals using the same method, approximately 68% of them would contain the mean weight of all penguins in Antarctica.)

Recall, what a k% confidence level states is that approximately k% of the time, the intervals you create through this process will contain the true population parameter.

In this question, our population parameter is the mean weight of all penguins in Antarctica. So 86% of the time, the intervals you create through this process will contain the mean weight of all penguins in Antarctica. This is the same as Option 4. However, it will be false if we state it in the reverse order (Option 1) since our population parameter is already fixed.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Problem 2

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of several species of penguins in a region of Antarctica.

One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.


Problem 2.1

Consider the histogram of mass below.


Select the true statement below.

Answer: The median mass of penguins is less than the average mass of penguins

This is a distribution that is skewed to the right, so mean is greater than median.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 2.2

Which of the following is a valid conclusion that we can draw solely from the histogram above?

Answer: The number of penguins with a mass of at most 3500 grams is greater than the number of penguins with a mass of at least 5500 grams.

Recall, a histogram has intervals on the axis, so we cannot know the frequency of an exact value. Thus, we cannot conclude statements 1, 3, 4. Since the frequency of an exact value is unknown, for statement 3, it is possible that all numbers we have in this distribution are even. Although in the graph, we are only given frequency rather than number, we can justify statement 2 by comparing the area in the left side of 3500, and the area in the right side of 5500. You can either estimate by visually comparing the areas of both parts or compute the area sum of both sides by estimating the bars’ height and windth.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 2.3

For your convenience, we show the histogram of mass again below.


Recall, there are 330 penguins in our dataset. Their average mass is 4200 grams, and the standard deviation of mass is 840 grams.

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 3276 grams and 5124 grams? Input your answer as a percentage between 0 and 100, without the % symbol. Round to three decimal places.

Answer: 17.355

Recall, Chebyshev’s inequality states that No matter what the shape of the distribution is, the proportion of values in the range “average ± z SDs” is at least 1 - \frac{1}{z^2}.

To approach the problem, we’ll start by converting 3276 grams and 5124 grams to standard units. Doing so yields \frac{3276 - 4200}{840} = -1.1, similarly, \frac{5124 - 4200}{840} = 1.1. This means that 3276 is 1.1 standard deviations below the mean, and 5124 is 1.1 standard deviations above the mean. Thus, we are calculating the proportion of values in the range “average ± 1.1 SDs”.

When z = 1.1, we have 1 - \frac{1}{z^2} = 1 - \frac{1}{1.1^2} \approx 0.173553719, which as a percentage rounded to three decimal places is 17.355\%.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 2.4

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 1680 grams and 5880 grams?

Answer: 75%

Recall: proportion with z SDs of the mean

Percent in Range All Distributions (via Chebyshev’s Inequality) Normal Distributions
\text{average} \pm 1 \ \text{SD} \geq 0\% \approx 68\%
\text{average} \pm 2\text{SDs} \geq 75\% \approx 95\%
\text{average} \pm 3\text{SDs} \geq 88\% \approx 99.73\%

To approach the problem, we’ll start by converting 3276 grams and 5124 grams to standard units. Doing so yields \frac{1680 - 4200}{840} = -3, similarly, \frac{5880 - 4200}{840} = 2. This means that 1680 is 3 standard deviations below the mean, and 5880 is 2 standard deviations above the mean.

Proportion of values in [-3 SUs, 2 SUs] >= Proportion of values in [-2 SUs, 2 SUs] >= 75% (Since we cannot assume that the distribution is normal, we look at the All Distributions (via Chebyshev’s Inequality) column for proportion).

Thus, at least 75% of the penguins have a mass between 1680 grams and 5880 grams.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 2.5

The distribution of mass in grams is not roughly normal. Is the distribution of mass in standard units roughly normal?

Answer: No

The shape of the distribution does not change since we are scaling the x values for all data.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.


Problem 2.6

Suppose all 330 penguin body masses (in grams) that the researchers collected are stored in an array called masses. We’d like to estimate the probability that two different randomly selected penguins from our dataset have body masses within 50 grams of one another (including a difference of exactly 50 grams). Fill in the missing pieces of the simulation below so that the function estimate_prob_within_50g returns an estimate for this probability.

def estimate_prob_within_50g():
    num_reps = 10000
    within_50g_count = 0
    for i in np.arange(num_reps):
        two_penguins = np.random.choice(__(a)__)
        if __(b)__:
            within_50g_count = within_50g_count + 1
    return within_50g_count / num_reps

What goes in blank (a)? What goes in blank (b)?

Answer: (a) masses, 2, replace=False (b) abs(two_penguins[0] - two_penguins[1])<=50

  1. Recall, np.random.choice( ) can have three parameters array, n, replace=False, and returns n elements from the array at random, without replacement. We are randomly choosing 2 different penguins from the masses array, so we are using np.random.choice( ) without replacement.
  2. We want to count the number of pairs of penguins that have body masses difference within 50 grams, so we are using the index to access the two penguins generated from two_penguins and calculating their absolute difference with abs(). And in this if condition, we only want to have penguins with absolute difference less than or equal to 50, so we write a <= condition to justify whether the generated pairs of penguins fulfill this requirement.

Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 2.7

Recall, there are 330 penguins in our dataset. Their average mass is 4200 grams, and the standard deviation of mass is 840 grams. Assume that the 330 penguins in our dataset are a random sample from the population of all penguins in Antarctica. Our sample gives us one estimate of the population mean.

To better estimate the population mean, we bootstrapped our sample and plotted a histogram of the resample means, then took the middle 68 percent of those values to get a confidence interval. Which option below shows the histogram of the resample means and the confidence interval we found?

Option 1

Option 2

Option 3

Option 4

Answer: Option 2

Recall, according to the Central Limit Theorem (CLT), the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

Thus, our graph should have a normal distribution. We eliminate Option 4.

Recall that the standard normal curve has inflection points at z = +-1, which is 68% proportion of a normal distribution.(inflection point is where a curve goes from “opening down” to “opening up”) Since we have a confidence intervel of 68% in this question, by looking at the inflection point, we can eliminate Option 3

To compute the SD of the sample mean’s distribution, when we don’t know the population’s SD, we can use the sample’s SD (840): \text{SD of Distribution of Possible Sample Means} \approx \frac{\text{Sample SD}}{\sqrt{\text{sample size}}} = \frac{840}{\sqrt{330}} \approx 46.24

Recall: proportion with z SDs of the mean

Percent in Range All Distributions (via Chebyshev’s Inequality) Normal Distributions
\text{average} \pm 1 \ \text{SD} \geq 0\% \approx 68\%
\text{average} \pm 2\text{SDs} \geq 75\% \approx 95\%
\text{average} \pm 3\text{SDs} \geq 88\% \approx 99.73\%

In this question, we want 68% confidence interval, given that the distribution of sample mean is roughly normal, our CI should have range \text{sample mean} \pm 1 \ \text{SD}. Thus, the interval is approximately [4200-46.24 = 4153.76, 4200+46.24=4246.24]. We compare the 68% CI in Option 1, 2 and we choose Option 2 since it has a 68% CI with approximately the same interval.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 2.8

Suppose boot_means is an array of the resampled means. Fill in the blanks below so that [left, right] is a 68% confidence interval for the true mean mass of penguins.

left = np.percentile(boot_means, __(a)__)
right = np.percentile(boot_means, __(b)__)
[left, right]

What goes in blank (a)? What goes in blank (b)?

Answer: (a) 16 (b) 84

Recall, np.percentile(array, p) computes the pth percentile of the numbers in array. To compute the 68% CI, we need to know the percentile of left tail and right tail.

left percentile = (1-0.68)/2 = (0.32)/2 = 0.16 so we have 16th percentile

right percentile = 1-((1-0.68)/2) = 1-((0.32)/2) = 1-0.16 = 0.84 so we have 84th percentile


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 2.9

Which of the following is a correct interpretation of this confidence interval? Select all that apply.

Answer: Option 4 (If we created many confidence intervals using the same method, approximately 68% of them would contain the mean weight of all penguins in Antarctica.)

Recall, what a k% confidence level states is that approximately k% of the time, the intervals you create through this process will contain the true population parameter.

In this question, our population parameter is the mean weight of all penguins in Antarctica. So 86% of the time, the intervals you create through this process will contain the mean weight of all penguins in Antarctica. This is the same as Option 4. However, it will be false if we state it in the reverse order (Option 1) since our population parameter is already fixed.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Source: fa22-final — Q11

Problem 3

Suppose variables v1, v2, v3, and v4, have already been initialized to specific numerical values. Right now, we don’t know what values they’ve been set to.

The function f shown below takes in a number, v, and outputs an integer between -2 and 2, depending on the value of v relative to v1, v2, v3, and v4.

def f(v):
    if v <= v1:
        return -2
    elif v <= v2:
        return -1
    elif v <= v3:
        return 0
    elif v <= v4:
        return 1
    else:
        return 2

Recall that in the previous problem, we created an array called sample_means containing 10,000 values, each of which is the mean of a random sample of 100 applicant ages drawn from the DataFrame apps, in which ages have a mean of 35 and a standard deviation of 10.

When we call the function f on every value v in sample_means, we produce a collection of 10,000 values all between -2 and 2. A density histogram of these values is shown below.


The heights of the five bars in this histogram, reading from left to right, are

x, 3x, 12x, 3x, x.


Problem 3.1

What is the value of x (i.e. the height of the shortest bar in the histogram)? Give your answer as a fully simplified fraction.

Answer: \frac{1}{20}

In any density histogram, the total area of all bars is 1. This histogram has five bars, each of which has a width of 1 (e.g. 3 - 2 = 1). Since \text{Area} = \text{Height} \cdot \text{Width}, we have that the area of each bar is equal to its height. So, the total area of the histogram in this case is the sum of the heights of each bar:

\text{Total Area} = x + 3x + 12x + 3x + x = 20x

Since we know that the total area is equal to 1, we have

20x = 1 \implies \boxed{x = \frac{1}{20}}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 3.2

What does the expression below evaluate to? Give your answer as an integer.

np.count_nonzero((sample_means > v2) & (sample_means <= v4))

Hint: Don’t try to find the values of v2 and v4 – you can answer this question without them!

Answer: 7,500

First, it’s a good idea to understand what the integer we’re trying to find actually means in the context of the information provided. In this case, it’s the number of sample means that are greater than v2 and less than or equal to v4. Here’s how to arrive at that conclusion:

  1. First, note that sample_means is an array of length 10,000.
  2. sample_means > v2 and sample_means <= v4 are both Boolean arrays of length 10,000.
  3. (sample_means > v2) & (sample_means <= v4) is also a Boolean array of length 10,000, which contains True for every sample mean that is greater than v2 and less than or equal to v4 and False for every other sample mean.
  4. Finally, np.count_nonzero((sample_means > v2) & (sample_means <= v4)) is a number between 0 and 10,000, corresponding to the number of True elements in the array (sample_means > v2) & (sample_means <= v4).

Remember, the histogram we’re looking at visualizes the distribution of the 10,000 values that result from calling f on every value in sample_means. To proceed, we need to understand how the function f decides what value to return for a given input, v:

  • If the input v is greater than v2, then the first two conditions (v <= v1 and v <= v2) are False, and so the only possible values of f(v) are 0, 1, or 2.
  • If the input v is less than or equal to v4, the only possible values of f(v) are -2, -1, 0, 1.
  • Thus, if the input v is greater than v2 and less than or equal to v4, the only possible values of f(v) are 0 and 1.

Now, our job boils down to finding the number of values in the visualized distribution that are equal to 0 or 1. This is equivalent to finding the number of values that fall in the [0, 1) and [1, 2) bins – since f only returns integer values, the only value in the [0, 1) bin is 0 and the only value in the [1, 2) bin is 1 (remember, histogram bins are left-inclusive and right-exclusive).

To do this, we need to find the proportion of values in those two bins, and multiply that proportion by the total number of values (10,000).

We know that the area of a bar is equal to the proportion of values in that bin. We also know that, in this case, the area of each bar is equal to its height, since the width of each bin is 1. Thus, the proportion of values in a given bin is equal to the height of the corresponding bar. As such, the proportion of values in the [0, 1) bin is 12x, and the proportion of values in the [1, 2) bin is 3x, meaning the proportion of values in the histogram that are equal to either 0 or 1 is 12x + 3x = 15x.

In the previous subpart, we found that x = \frac{1}{20}, so the proportion of values in the histogram that are equal either 0 or 1 is 15 \cdot \frac{1}{20} = \frac{3}{4}*, and since there are 10,000 values being visualized in the histogram total, \frac{3}{4} \cdot 10,000 = 7,500 of them are equal to either 0 or 1.

Thus, 7,500 of the values in sample_means are greater than v2 and less than or equal to v4, so np.count_nonzero((sample_means > v2) & (sample_means <= v4)) evaluates to 7,500.

Note: It’s possible to answer this subpart without knowing the value of x, i.e. without answering the previous subpart. The area of the [0, 1) and [1, 2) bars is 15x, and the total area of the histogram is 20x. So, the proportion of the area in [0, 1) or [1, 2) is \frac{15x}{20x} = \frac{15}{20} = \frac{3}{4}, which is the same value we found by substituting x = \frac{1}{20} into 15x.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.


Suppose we have run the code below.

from scipy import stats

def g(u):
    return stats.norm.cdf(u) - stats.norm.cdf(-u)

Several input-output pairs for the function g are shown in the table below. Some of them will be useful to you in answering the questions that follow.

ug(u)
0.840.60
1.280.80
1.650.90
2.250.975


Problem 3.3

What is the value of v3, one of the variables used in the function f? Give your answer as a number.

Hint: Use the histogram as well as one of the rows of the table above.

Answer: 35.84

The table provided above tells us the proportion of values within u standard deviations of the mean in a normal distribution, for various values of u. For instance, it tells us that the proportion of values within 1.28 standard deviations of the mean in a normal distribution is 0.8.

Let’s reflect on what we know at the moment:

  • The distribution of the sample mean is roughly normal, by the Central Limit Theorem. Normal distributions are symmetric, and have a “peak” at the center. The histogram above is also symmetric and has its peak at its center.
  • The proportion of values in the histogram that are equal to 0 is 12x = 12 \cdot \frac{1}{20} = 0.6.
  • The function f returns 0 for all inputs that are greater than v2 and less than or equal to v3. This, combined with the fact above, tells us that the proportion of sample means between v2 (exclusive) and v3 (inclusive) is 0.6.
  • From the table provided, we know that in a normal distribution, the proportion of values within 0.84 standard deviations of the mean is 0.6.

Combining the facts above, we have that v2 is 0.84 standard deviations below the mean of the sample mean’s distribution and v3 is 0.84 standard deviations above the mean of the sample mean’s distribution.

The sample mean’s distribution has the following characteristics:

\begin{align*} \text{Mean of Distribution of Possible Sample Means} &= \text{Population Mean} = 35 \\ \text{SD of Distribution of Possible Sample Means} &= \frac{\text{Population SD}}{\sqrt{\text{Sample Size}}} = \frac{10}{\sqrt{100}} = 1 \end{align*}

0.84 standard deviations above the mean of the sample mean’s distribution is:

35 + 0.84 \cdot \frac{10}{\sqrt{100}} = 35 + 0.84 \cdot 1 = \boxed{35.84}

So, the value of v3 is 35.84.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 9%.


Problem 3.4

Which of the following is closest to the value of the expression below?

np.percentile(sample_means, 5)

Answer: 33.35

The table provided tells us that in a normal distribution, 90% of values are within 1.65 standard deviations of the mean. Since normal distributions are symmetric, it also means that 5% of values are above 1.65 standard deviations of the mean and, more importantly, 5% of values are below 1.65 standard deviations of the mean.

The 5th percentile of a distribution is the smallest value that is greater than or equal to 5% of values, so in this case the 5th percentile is 1.65 SDs below the mean. As in the previous subpart, the mean and SD we are referring to are the mean and SD of the distribution of sample means (sample_means), which we found to be 35 and 1, respectively.

1.65 standard deviations below this mean is

35 - 1.65 \cdot 1 = \boxed{33.35}


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



Source: fa23-final — Q11

Problem 4

On Reddit, Yutian read that 22% of all online transactions are fraudulent. She decides to test the following hypotheses:

To test her hypotheses, she decides to create a 95% confidence interval for the proportion of online transactions that are fraudulent using the Central Limit Theorem.

Unfortunately, she doesn’t have access to the entire txn DataFrame; rather, she has access to a simple random sample of txn of size n. In her sample, the proportion of transactions that are fraudulent is 0.2 (or equivalently, \frac{1}{5}).


Problem 4.1

The width of Yutian’s confidence interval is of the form \frac{c}{5 \sqrt{n}}

where n is the size of her sample and c is some positive integer. What is the value of c? Give your answer as an integer.

Hint: Use the fact that in a collection of 0s and 1s, if the proportion of values that are 1 is p, the standard deviation of the collection is \sqrt{p(1-p)}.

Answer: 8

First, we can calculate the standard deviation of the sample using the given formula: \sqrt{0.2\cdot(1-0.2)} = \sqrt{0.16}= 0.4. Additionally, we know that the width of a 95% confidence interval for a population mean (including a proportion) is approximately \frac{4 * \text{sample SD}}{\sqrt{n}}, since 95% of the data of a normal distribution falls within two standard deviations of the mean on either side. Now, plugging the sample standard deviation into this formula, we can set this expression equal to the given formula for the width of the confidence interval: \frac{c}{5 \sqrt{n}} = \frac{4 * 0.4}{\sqrt{n}}. We can multiply both sides by \sqrt{n}, and we’re left with \frac{c}{5} = 4 * 0.4. Now, all we have to do is solve for c by multiplying both sides by 5, which gives c = 8.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 4.2

There is a positive integer J such that:

What is the value of J? Give your answer as an integer.

Answer: 1600

Here, we have to use the formula for the endpoints of the 95% confidence interval to see what the largest value of n is such that 0.22 will be contained in the interval. The endpoints are given by \text{sample mean} \pm 2 * \frac{\text{sample SD}}{\sqrt{n}}. Since the null hypothesis is that the proportion is 0.22 (which is greater than our sample mean), we only need to work with the right endpoint for this question. Plugging in the values that we have, the right endpoint is given by 0.2 + 2 * \frac{0.4}{\sqrt{n}}. Now we must find a value of n which satisfies the inequality 0.2 + 2 * \frac{0.4}{\sqrt{n}} \geq 0.22, and since we’re looking for the smallest such value of n (i.e, the last n for which this inequality holds), we can simply set the two sides equal to each other, and solve for n. From 0.2 + 2 * \frac{0.4}{\sqrt{n}} = 0.22, we can subtract 0.2 from both sides, then multiply both sides by \sqrt{n}, and divide both sides by 0.02 (from 0.22 - 0.2). This yields \sqrt{n} = \frac{2 * 0.4}{0.02} = \sqrt{n} = 40, which implies that n is 1600.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 21%.



Problem 5

Suppose you have correctly implemented the function area_between(x, y) so that it returns the area under the armchair curve between x and y, assuming the inputs satisfy -4 <= x <= y <= 12.

Note: You can still do this question, even if you didn’t know how to do the previous one.


Problem 5.1

What is the approximate value of area_between(-2, 10)?

Answer: 1.95

The area we want to find is shown below in two colors. We can find the area in each half of the armchair curve separately and add the results.

For the yellow area, we know that the area within 2 standard deviations of the mean on the standard normal curve is 0.95. The remaining 0.05 is split equally on both sides, so the yellow area is 0.975.

The blue area is the same by symmetry so the total shaded area is 0.975*2 = 1.95.

Equivalently, we can use the fact that the total area under the armchair curve is 2, and the amount of unshaded area on either side is 0.025, so the total shaded area is 2 - (0.025*2) = 1.95.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 5.2

What is the approximate value of area_between(0.37, 8.37)?

Answer: 1

The area we want to find is shown below in two colors.

As we saw in Problem 12.2, the point on the left half of the armchair curve that corresponds to 8.37 is 0.37. This means that if we move the blue area from the right half of the armchair curve to the left half, it will fit perfectly, as shown below.

Therefore the total of the blue and yellow areas is the same as the area under one standard normal curve, which is 1.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.



Problem 6

An IKEA chair designer is experimenting with some new ideas for armchair designs. She has the idea of making the arm rests shaped like bell curves, or normal distributions. A cross-section of the armchair design is shown below.


This was created by taking the portion of the standard normal distribution from z=-4 to z=4 and adjoining two copies of it, one centered at z=0 and the other centered at z=8. Let’s call this shape the armchair curve.

Since the area under the standard normal curve from z=-4 to z=4 is approximately 1, the total area under the armchair curve is approximately 2.

Complete the implementation of the two functions below:

  1. area_left_of(z) should return the area under the armchair curve to the left of z, assuming -4 <= z <= 12, and
  2. area_between(x, y) should return the area under the armchair curve between x and y, assuming -4 <= x <= y <= 12.
import scipy

def area_left_of(z):
    '''Returns the area under the armchair curve to the left of z.
       Assume -4 <= z <= 12'''
    if ___(a)___: 
        return ___(b)___ 
    return scipy.stats.norm.cdf(z)

def area_between(x, y):
    '''Returns the area under the armchair curve between x and y. 
    Assume -4 <= x <= y <= 12.'''
    return ___(c)___


Problem 6.1

What goes in blank (a)?

Answer: z>4 or z>=4

The body of the function contains an if statement followed by a return statement, which executes only when the if condition is false. In that case, the function returns scipy.stats.norm.cdf(z), which is the area under the standard normal curve to the left of z. When z is in the left half of the armchair curve, the area under the armchair curve to the left of z is the area under the standard normal curve to the left of z because the left half of the armchair curve is a standard normal curve, centered at 0. So we want to execute the return statement in that case, but not if z is in the right half of the armchair curve, since in that case the area to the left of z under the armchair curve should be more than 1, and scipy.stats.norm.cdf(z) can never exceed 1. This means the if condition needs to correspond to z being in the right half of the armchair curve, which corresponds to z>4 or z>=4, either of which is a correct solution.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 6.2

What goes in blank (b)?

Answer: 1+scipy.stats.norm.cdf(z-8)

This blank should contain the value we want to return when z is in the right half of the armchair curve. In this case, the area under the armchair curve to the left of z is the sum of two areas:

  1. the area under the entire left half of the armchair curve, which is 1, and
  2. the area under the portion of the right half of the armchair curve that falls to the left of z.

Since the right half of the armchair curve is just a standard normal curve that’s been shifted to the right by 8 units, the area under that normal curve to the left of z is the same as the area to the left of z-8 on the standard normal curve that’s centered at 0. Adding the portion from the left half and the right half of the armchair curve gives 1+scipy.stats.norm.cdf(z-8).

For example, if we want to find the area under the armchair curve to the left of 9, we need to total the yellow and blue areas in the image below.

The yellow area is 1 and the blue area is the same as the area under the standard normal curve (or the left half of the armchair curve) to the left of 1 because 1 is the point on the left half of the armchair curve that corresponds to 9 on the right half. In general, we need to subtract 8 from a value on the right half to get the corresponding value on the left half.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Problem 6.3

What goes in blank (c)?

Answer: area_left_of(y) - area_left_of(x)

In general, we can find the area under any curve between x and y by taking the area under the curve to the left of y and subtracting the area under the curve to the left of x. Since we have a function to find the area to the left of any given point in the armchair curve, we just need to call that function twice with the appropriate inputs and subtract the result.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.



Source: sp23-final — Q3

Problem 7

This summer, Zoe wants to explore parts of the United States that she hasn’t been to yet. In her process of figuring out where to go, she creates a histogram depicting the distribution of the number of sunshine hours in July across all cities in the United States in sun.


Suppose usa is a DataFrame with all of the columns in sun but with only the rows where "Country" is "United States".


Problem 7.1

What is the value of mystery below?

    cond = (usa.get("Jul") >= 370) & (usa.get("Jul") < 430)
    mystery = 100 * np.count_nonzero(cond) / usa.shape[0]

Answer: 12

cond is a Series that contains True for each row in usa where "Jul" is greater than or equal to 370 and less than 430. mystery, then, is the percentage of values in usa in which cond is True. This is because np.count_nonzero(cond) is the number of Trues in cond, np.count_nonzero(cond) / usa.shape[0] is the proportion of values in cond that are True, and 100 * np.count_nonzero(cond) / usa.shape[0] is the percentage of values in cond that are True. Our goal here, then, is to use the histogram to find the percentage of values in the histogram between 370 (inclusive) and 430 (exclusive).

We know that in histograms, the area of each bar is equal to the proportion of data points that fall within its bin’s range. Conveniently, there’s only one bar we need to look at – the one corresponding to the bin [370, 430). That bar has a width of 430 - 370 = 60 and a height of 0.002. Then, the area of that bar – i.e. the proportion of values that are between 370 (inclusive) and 430 (exclusive) is:

\text{proportion} = \text{area} = \text{height} \cdot \text{width} = 0.002 \cdot 60 = 0.12

This means that the proportion of values in [370, 430) is 0.12, which means that the percentage of values in [370, 430) is 12%, and that mystery evaluates to 12.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 7.2

There are 5 more cities with between 370 and 430 sunshine hours in July than there are cities with between 270 and 290 sunshine hours in July.

How many cities in the United States are in sun? Give your answer as a positive integer, rounded to the nearest multiple of 10 (that is, your answer should end in a 0).

Answer: 250

In the previous part, we learned that the proportion of cities in the usa DataFrame in the interval [370, 430) (i.e. that have between 370 and 430 sunshine hours in July) is 0.12. To use the fact that there are 5 more cities in the interval [370, 430) than there are in the interval [270, 290), we need to first find the proportion of cities in the interval [270, 290). To do so, we look at the [270, 290) bin, which has a width of 290 - 270 = 20 and a height of 0.005:

\text{proportion} = 0.005 \cdot 20 = 0.10

We are told that there are 5 more cities in the [370, 430) interval than there are in the [270, 290) interval. Given the proportions we’ve computed, we have that:

\text{difference in proportions} = 0.12 - 0.1 = 0.02

If 0.02 \cdot \text{number of cities} is 5, then \text{number of cities} = 5 \cdot \frac{1}{0.02} = 5 \cdot 50 = 250.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.


Now, suppose we convert the number of sunshine hours in July for all cities in the United States (i.e., “US cities”) in sun from their original units (hours) to standard units.


Problem 7.3

Let m be the mean number of sunshine hours in July for all US cities in sun, in standard units. Select the true statement below.

Answer: m = 0

When we standardize a dataset, the mean of the resulting values is always 0 and the standard deviation of the resulting values is always 1. This tells us right away that the answer is m = 0. Intuitively, we know that a value in standard units represents the number of standard deviations that value is above or below the mean of the column it came from. m is equal to the mean of the column it came from, so m in standard units is 0.

If we’d like to approach this more algebracically, we can remember the formula for converting a value x_i from a column x to standard units:

x_{i \: \text{(su)}} = \frac{x_i - \text{mean of } x}{\text{SD of } x}

Let x be the column (i.e. Series) containing the mean number of sunshine hours in July for all US cities in sun. m, by definition, is the mean of x. Then,

m_{\text{(su)}} = \frac{m - \text{mean of } x}{\text{SD of } x} = \frac{m - m}{\text{SD of }x} = 0

Given that m is the mean of column x, the numerator of m_\text{(su)} is 0, and hence m_\text{(su)} is 0.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 7.4

Let s be the standard deviation of the number of sunshine hours in July for all US cities sun, in standard units. Select the true statement below.

Answer: s = 1

As mentioned in the previous solution, when we standardize a dataset, the mean of the resulting values is always 0 and the standard deviation of the resulting values is always 1.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Problem 7.5

Let d be the median of the number of sunshine hours in July for all US cities in sun, in standard units. Select the true statement below.

Answer: -1 < d < 0

In the histogram, we see that the distribution of the number of sunshine hours in July for all US cities in sunis skewed right, or has a right tail. This means that this distribution’s mean is dragged in the direction of its tail and is larger than its median. Since the mean in standard units is 0, and the median is less than the mean, the median in standard units must be negative. There’s no property that states that the median is exactly -1, and the median is only slightly less than the mean, which means that it must be the case that -1 < d < 0.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.


Problem 7.6

True or False: The distribution of the number of sunshine hours in July for all US cities in sun, in standard units, is roughly normal.

Answer: False

The original histogram depicting the distribution of the number of sunshine hours in July for all US cities is right-skewed. When data is converted to standard units, the shape of the distribution does not change. Therefore, if the original data is right-skewed, the standardized data will also be right-skewed.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



Source: sp23-final — Q8

Problem 8

Australia is in the southern hemisphere, meaning that its summer season is from December through February, when we have our winter. As a result, January is typically one of the sunniest months in Australia!

Arjun is a big fan of the movie Kangaroo Jack and plans on visiting Australia this January. In doing his research on where to go, he found the number of sunshine hours in January for the 15 Australian cities in sun and sorted them in descending order.

356, 337, 325, 306, 294, 285, 285, 266, 263, 257, 255, 254, 220, 210, 176

Throughout this question, use the mathematical definition of percentiles presented in class.

Note: Parts 1, 2, and 3 of this problem are out of scope; they cover material no longer included in the course. Part 4 is in scope!


Problem 8.1

What is the 80th percentile of the collection of numbers above?

Answer: 306

First, we need to find the position of the 80th percentile using the rule from class:

h = \left(\frac{80}{100}\right) \cdot 15 = \frac{4}{5} \cdot 15 = 12

Since 12 is an integer, we don’t need to round up, so k = 12. Starting from the right-most number, which is the smallest number and hence position 1 here, the 12th number is 306.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 8.2

What is the largest positive integer p such that 257 is the pth percentile of the collection of numbers above?

Answer: 40

The first step is to find the position of 257 in the collection when we start at position 1, which is 6. Since there are 15 values total, this means that 257 is the smallest value that is greater than or equal to 40% of the values.

If we set p to be any number larger than 40, say, 41, then 257 won’t be larger than p\% of the values in the collection; thus, the largest positive integer value of p that makes 257 the pth percentile is 40.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 30%.



Problem 8.3

What is the smallest positive integer p such that 257 is the pth percentile of the collection of numbers above? (Make sure your answer to (c) is smaller than your answer to (b)!)

Answer: 34

Let’s look at the next number down from 257, which is 255. 255 is the 5th number out of 15, so it is the smallest number that is greater than or equal to 33.333% of the values. This means the 33rd percentile is also 255, since 33.333 > 33. However, 255 is not greater than or equal to 34% of the values, which makes the 34th percentile 257. Therefore, 34 is the smallest integer value of p such that the pth percentile is 257.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 21%.


Teresa also wants to go to Australia, but can’t take time off work in January, and so she plans a trip to The Land Down Under (Australia) in February instead. She finds that the mean number of sunshine hours in February for all 15 Australian cities in sun is 250, with a standard deviation of 15.


Problem 8.4

According to Chebyshev’s inequality, at least what percentage of Australian cities in sun see between 200 and 300 sunshine hours in February?

Answer: 91%

First, we need to find the number of standard deviations above the mean 300 is, and the number of standard deviations below the mean 200 is.

z = \frac{300 - 250}{15} = \frac{50}{15} = \frac{10}{3}

The above equation tells us that 300 is \frac{10}{3} standard deviations above the mean; you can verify that 200 is the same number of standard deviations below the mean. Chebyshev’s inequality tells us the proportion of values within z SDs of the mean is at least 1 - \frac{1}{z^2}, which here is:

1 - \frac{1}{\left(\frac{10}{3}\right)^2} = 1 - \frac{9}{100} = 0.91


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.



Source: sp24-final — Q5

Problem 9

You are given the following information about security deposits for a sample of 400 apartments in the Mission Hills neighborhood of San Diego:

Using the fact that scipy.stats.norm.cdf(-0.8) evaluates to about 0.21, construct a 58% confidence interval for the mean security deposit of all Mission Hills apartments. Below, give the endpoints of your confidence interval, both as integers.

Left endpoint: ____(a)____
Right endpoint: ____(b)____

Answer:

  • (a) 2280
  • (b) 2320

scipy.stats.norm.cdf(-0.8) tells us that from the bounds of (-\inf, -0.8], the normal distribution has an area of 0.21. Therefore, if we take it to the other side from [0.8, \inf), it also has an area of 0.21 due to the symmetrical property of the normal distribution. This means that the interval between [-0.8, 0.8] has an area of 1 - 0.21 - 0.21 = 0.58: the confidence interval we are aiming to find.

In the question, we are given the standard deviation of security deposits in a sample, meaning we need to find the standard deviation for the population. To find this, we use the following formula and compute:

\frac{\text{SD of sample}}{\sqrt{\text{sample size}}} = \frac{500}{\sqrt{400}} = \frac{500}{20} = 25.

Now that we have the population standard deviation, we can calculate the endpoints of the confidence interval.

Left endpoint: 2300 - \frac{4}{5} \cdot 25 = 2280

Right endpoint: 2300 + \frac{4}{5} \cdot 25 = 2320


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 29%.


Source: su24-final — Q4

Problem 10

Below is a density histogram representing the distribution of randomly sampled stage distances.


Problem 10.1

Which statement below correctly describes the relationship between the mean and the median of the sampled stage distances?

Answer: The mean is approximately equal to the median.

  • The histogram appears to be approximately symmetric, with the peak near the center of the distribution.
  • For symmetric distributions, the mean and the median are approximately equal because the data is evenly distributed around the central point.
  • If the distribution were skewed:
    • A right-skewed distribution would have the mean significantly larger than the median.
    • A left-skewed distribution would have the mean significantly smaller than the median.
  • In this case, there is no visible skew, so the correct answer is that the mean is approximately equal to the median.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.



Problem 10.2

Assume there are 100 stages in the random sample that generated this plot. If there are 5 stages in the bin [275, 300), approximately how many stages are in the bin [200, 225)?

Answer: 35 = 5\cdot7

  • The height of the bin [200, 225) on the density histogram is approximately 7 times the height of the bin [275, 300).
  • Since the number of stages in a bin is proportional to the bin’s height, the number of stages in [200, 225) is 35 = 5\cdot7.

Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 10.3

Assume the mean distance is 200 km and the standard deviation is 50 km. At least what proportion of stage distances are guaranteed to lie between 0 km and 400 km? Do not simplify your answer.

Answer: \frac{15}{16}

Using Chebyshev’s inequality, we know at least 1 - \frac{1}{z^2} of the data lies within z SDs. Here, z = 4 so we know 1 - \frac{1}{16} = \frac{15}{16} of the data lie in that range.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 10.4

Again, assume the mean stage distance is 200 km and the standard deviation is 50 km. Now, suppose we take a random sample of size 25 from the stage distances, calculate the mean stage distance of this sample, and repeat this process 500 times. What proportion of the means that we calculate will fall between 190 km and 210 km? Do not simplify your answer.

Answer: 68%

We know about 68% of values lie within 1 standard deviation of the mean of any normal distribution. The distribution of means of samples of size 25 from this dataset is normally distributed with mean 200km and SD \frac{50}{\sqrt{25}} = 10, so 190km to 210km contains 68% of the values.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.


Problem 10.5

Assume the mean distance is 200 km and the standard deviation is 50 km. Suppose we use the Central Limit Theorem to generate a 95% confidence interval for the true mean distance of all Tour de France stages, and get the interval [190\text{ km}, 210\text{ km}]. Which of the following interpretations of this confidence interval are correct?

Answer: Option 3, Option 4, and Option 7

Option 1:
Incorrect. Confidence intervals describe the uncertainty in estimating the population mean, not the proportion of data points. A 95% confidence interval does not imply that 95% of individual stage distances fall between 190 km and 210 km.

Option 2:
Incorrect. Confidence intervals are based on the sampling process, not probability. Once the interval is calculated, the true mean is either inside or outside the interval. We cannot assign a probability to this.

Option 3:
Correct. This is the standard interpretation of confidence intervals: “We are 95% confident that the true mean lies within the interval.”

Option 4:
Correct. Given a sample size of 100 and population standard deviation of 50, the confidence interval ([190, 210]) is consistent with the calculation using the rule of thumb that a 95% confidence interval is approximately 2 standard deviations apart from the mean.

For a 95% confidence interval, the range can be approximated as:

\left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]

Substituting the given values:

  • Sample mean = 200
  • Sample standard deviation = 50
  • Sample size = 100 \text{CI} = \left[ 200 - 2\cdot \frac{50}{\sqrt{100}}, 200 + 2\cdot \frac{50}{\sqrt{100}} \right] Simplify: \text{CI} = \left[ 200 - 2\cdot 5, 200 + 2\cdot 5 \right] \text{CI} = [190, 210]

Option 5:
Incorrect. refer to option 4

Option 6:
Incorrect. The wording “exactly 95%” is overly precise. In practice, confidence intervals are based on the sampling process, and we use “approximately” or “roughly” 95%.

Option 7:
Correct. By definition of a confidence interval, if we repeatedly sampled and constructed 95% confidence intervals, roughly 95% of them would contain the true mean.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 10.6

Suppose we take 500 random samples of size 100 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram A. Then, we take 500 random samples of size 1000 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram B. Fill in the blanks so that the sentence below correctly describes how Histogram B looks in comparison to Histogram A.

“Relative to Histogram A, Histogram B would appear __(i)__ and shifted __(ii)__ due to the __(iii)__ mean and the __(iv)__ standard deviation.”

(i):

(ii):

(iii):

(iv):

Answer:

  • (i): thinner
    • Histogram B would appear thinner because larger sample sizes reduce the variability of the sample means. With a sample size of 1000 (compared to 100 for Histogram A), the standard error decreases, leading to a narrower distribution.
  • (ii): not at all
    • Histogram B would not shift left or right because the sample mean does not depend on the sample size. Both histograms have the same mean, as they are based on the same population.
  • (iii): unchanged
    • The mean remains unchanged because the mean of the sampling distribution (the population mean) does not depend on sample size.
  • (iv): smaller
    • The standard deviation of Histogram B is smaller because \text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}
    Increasing sample size from 100 to 1000 decreases the Sample standard deviation as the Population standard deviation remains unchanged, leading to a smaller standard deviation for the sampling distribution.

Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 11

Rank these three students in ascending order of their exam performance relative to their classmates.

Answer: Vivek, Hector, Clara

To compare Vivek, Hector, and Clara’s relative performance we want to compare their Z scores to handle standardization. For Vivek, his Z score is (83-75) / 6 = 4/3. For Hector, his score is (77-70) / 5 = 7/5. For Clara, her score is (80-75) / 3 = 5/3. Ranking these, 5/3 > 7/5 > 4/3 which yields the result of Vivek, Hector, Clara.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 12


Problem 12.1

Recall that the mean points per game is 7, with a standard deviation of 5. Also note that for all players, points per game must be greater than or equal to 0.

Using Chebyshev’s inequality, we find that at least p\% of players scored 25 or fewer points per game.

What is the value of p? Give your answer as number between 0 and 100, rounded to 3 decimal places.

Answer: 92.284\%

Recall, Chebyshev’s inequality states that the proportion of values within z standard deviations of the mean is at least 1 - \frac{1}{z^2}.

To approach the problem, we’ll start by converting 25 points per game to standard units. Doing so yields \frac{25 - 7}{5} = 3.6. This means that 25 is 3.6 standard deviations above the mean. The value 3.6 standard deviations below the mean is 7 - 3.6 \cdot 5 = -11, so when we use Chebyshev’s inequality with z = 3.6, we will get a lower bound on the proportion of values between -11 and 25. However, as the question tells us, points per game must be non-negative, so in this case the proportion of values between -11 and 25 is the same as the proportion of values between 0 and 25 (i.e. the proportion of values less than or equal to 25).

When z = 3.6, we have 1 - \frac{1}{z^2} = 1 - \frac{1}{3.6^2} = 0.922839, which as a percentage rounded to three decimal places is 92.284\%. Thus, at least 92.284\% scored 25 or fewer points per game.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.


Problem 12.2

Note: This problem is out of scope; it covers material no longer included in the course.

Note: This question uses the mathematical definition of percentile, not np.percentile.

The array aces defined below contains the points per game scored by all members of the Las Vegas Aces. Note that it contains 14 numbers that are in sorted order.

aces = np.array([0, 0, 1.05, 1.47, 1.96, 2, 3.25, 
                 10.53, 11.09, 11.62, 12.19, 
                 14.24, 14.81, 18.25])

As we saw in lab, percentiles are not unique. For instance, the number 1.05 is both the 15th percentile and 16th percentile of aces.

There is a positive integer q, between 0 and 100, such that 14.24 is the qth percentile of aces, but 14.81 is the (q+1)th percentile of aces.

What is the value of q? Give your answer as an integer between 0 and 100.

Answer: 85

For reference, recall that we find the pth percentile of a collection of n numbers as follows:

  1. Sort the collection in increasing order.
  2. Define h to be p\% of n:

h = \frac{p}{100} \cdot n

  1. If h is an integer, define k = h. Otherwise, let k be the smallest integer greater than h.

  2. Take the kth element of the sorted collection (start counting from 1, not 0).


To start, it’s worth emphasizing that there are n = 14 numbers in aces total. 14.24 is at position 12 (when the positions are numbered 1 through 14).

Let’s try and find a value of p such that 14.24 is the pth percentile. To do so, we might try and find what “percentage” of the way through the distribution 14.24 is; doing so gives \frac{12}{14} = 85.71\%. If we follow the process outlined above with p = 85, we get that h = \frac{85}{100} \cdot 14 = 11.9 and thus k = 12, meaning that the 85th percentile is the number at position 12, which 14.24.

Let’s see what happens when we try the same process with p = 86. This time, we have h = \frac{86}{100} \cdot 14 = 12.04 and thus k = 13, meaning that the 86th percentile is the number at position 13, which is 14.81.

This means that the value of q is 85 – the 85th percentile is 14.24, while the 86th percentile is 14.81.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 13

In addition to the plum DataFrame, we also have access to the season DataFrame, which contains statistics on all players in the WNBA in the 2021 season. The first few rows of season are shown below. (Remember to keep the data description from the top of the exam open in another tab!)

Each row in season corresponds to a single player. For each player, we have: - 'Player' (str), their name - 'Team' (str), the three-letter code of the team they play on - 'G' (int), the number of games they played in the 2021 season - 'PPG' (float), the number of points they scored per game played - 'APG' (float), the number of assists (passes) they made per game played - 'TPG' (float), the number of turnovers they made per game played

Note that all of the numerical columns in season must contain values that are greater than or equal to 0.


Problem 13.1

Which of the following is the best choice for the index of season?

Answer: 'Player'

Ideally, the index of a DataFrame is unique, so that we can use it to “identify” the rows. Here, each row is about a player, so 'Player' should be the index. 'Player' is the only column that is likely to be unique; it is possible that two players have the same name, but it’s still a better choice of index than the other three options, which are definitely not unique.


Difficulty: ⭐️

The average score on this problem was 95%.



Problem 14

In addition to the plum DataFrame, we also have access to the season DataFrame, which contains statistics on all players in the WNBA in the 2021 season. The first few rows of season are shown below. (Remember to keep the data description from the top of the exam open in another tab!)

Each row in season corresponds to a single player. For each player, we have: - 'Player' (str), their name - 'Team' (str), the three-letter code of the team they play on - 'G' (int), the number of games they played in the 2021 season - 'PPG' (float), the number of points they scored per game played - 'APG' (float), the number of assists (passes) they made per game played - 'TPG' (float), the number of turnovers they made per game played

Note that all of the numerical columns in season must contain values that are greater than or equal to 0.


Problem 14.1

Which of the following is the best choice for the index of season?

Answer: 'Player'

Ideally, the index of a DataFrame is unique, so that we can use it to “identify” the rows. Here, each row is about a player, so 'Player' should be the index. 'Player' is the only column that is likely to be unique; it is possible that two players have the same name, but it’s still a better choice of index than the other three options, which are definitely not unique.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 14.2

Note: For the rest of the exam, assume that the index of season is still 0, 1, 2, 3, …

Below is a histogram showing the distribution of the number of turnovers per game for all players in season.

Suppose, throughout this question, that the mean number of turnovers per game is 1.25. Which of the following is closest to the median number of turnovers per game?

Answer: 1

The median of a distribution is the value that is “halfway” through the distribution, i.e. the value such that half of the values in the distribution are larger than it and half the values in the distribution are smaller than it.

Visually, we’re looking for the location on the x-axis where we can draw a vertical line that splits the area of the histogram in half. While it’s impossible to tell the exact median of the distribution, since we don’t know how the values are distributed within the bars, we can get pretty close by using this principle.

Immediately, we can rule out 0.5, 0.75, 1.5, and 1.75, since they are too far from the “center” of the distribution (imagine drawing vertical lines at any of those points on the x-axis; they don’t split the distribution’s area in half). To decide between 1 and 1.25, we can use the fact that the distribution is right-skewed, meaning that its mean is larger than its median (intuitively, the mean is dragged in the direction of the tail, which is to the right). This means that the median should be less than the mean. We are given that the mean of the distribution is 1.25, so the median should be 1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 14.3

Sabrina Ionescu and Sami Whitcomb are both players on the New York Liberty, and are both California natives.

In “original units”, Sabrina Ionescu had 3.5 turnovers per game. In standard units, her turnovers per game is 3.

In standard units, Sami Whitcomb’s turnovers per game is -1. How many turnovers per game did Sami Whitcomb have in original units? Round your answer to 3 decimal places.

Note: You will need the fact from the previous subpart that the mean number of turnovers per game is 1.25.

Answer: 0.5

To convert a value x to standard units (denoted by x_{\text{su}}), we use the following formula:

x_{\text{su}} = \frac{x - \text{mean of }x}{\text{SD of }x}

Let’s look at the first line given to us: In “original units”, Sabrina Ionescu had 3.5 turnovers per game. In standard units, her turnovers per game is 3.

Substituting the information we know into the above equation gives us:

3 = \frac{3.5 - 1.25}{\text{SD of }x}

In order to convert future values from original units to standard units, we’ll need to know \text{SD of }x, which we don’t currently but can obtain by rearranging the above equation. Doing so yields

\text{SD of }x = \frac{3.5-1.25}{3} = \frac{2.25}{3} = 0.75

Now, let’s look at the second line we’re given: In standard units, Sami Whitcomb’s turnovers per game is -1. How many turnovers per game did Sami Whitcomb have in original units? Round your answer to 3 decimal places.

We have all the information we need to convert Sami Whitcomb’s turnovers per game from standard units to original units! Plugging in the values we know gives us:

\begin{aligned} x_{\text{su}} &= \frac{x - \text{mean of }x}{\text{SD of }x} \\ -1 &= \frac{x - 1.25}{0.75} \\ -0.75 &= x - 1.25 \\ 1.25 - 0.75 &= x \\ x &= \boxed{0.5} \end{aligned}

Thus, in original units, Sami Whitcomb averaged 0.5 turnovers per game.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 14.4

What is the smallest possible number of turnovers per game, in standard units? Round your answer to 3 decimal places.

Answer: -1.667

The smallest possible number of turnovers per game in original units is 0 (which a player would have if they never had a turnover – that would mean they’re really good!). To find the smallest possible turnovers per game in standard units, all we need to do is convert 0 from original units to standard units. This will involve our work from the previous subpart.

\begin{aligned} x_{\text{su}} &= \frac{x - \text{mean of }x}{\text{SD of }x} \\ &= \frac{0 - 1.25}{0.75} \\ &= -\frac{1.25}{0.75} \\ &= -\frac{5}{3} = \boxed{-1.667} \end{aligned}


Difficulty: ⭐️⭐️

The average score on this problem was 82%.



Source: wi24-final — Q14

Problem 15


Problem 15.1

Suppose that in Olympic ski jumping, ski jumpers jump off of a ramp that’s shaped like a portion of a normal curve. Drawn from left to right, a full normal curve has an inflection point on the ascent, then a peak, then another inflection point on the descent. A ski jump ramp stops at the point that is one third of the way between the inflection point on the ascent and the peak, measured horizontally. Below is an example ski jump ramp, along with the normal curve that generated it.

Fill in the blank below so that the expression evaluates to the area of a ski jump ramp, if the area under the normal curve that generated it is 1.

    from scipy import stats
    stats.norm.cdf(______)

What goes in the blank?

Answer: -2/3

We know that the normal distribution is symmetric about the mean, and that the mean is the “peak” described in the graph. The inflection points occur one standard deviation above and below the mean (the peak), so a point which is one third of the way in between the first inflection point and the peak is -(1-\frac{1}{3}) = -\frac{2}{3} standard deviations from the mean. We can then use stats.norm.cdf(-2/3) to calculate the area under the curve to the left of this point.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.


Problem 15.2

Suppose that in Olympic downhill skiing, skiers compete on mountains shaped like normal distributions with mean 50 and standard deviation 8. Skiers start at the peak and ski down the right side of the mountain, so their x-coordinate increases.

Keenan is an Olympic downhill skier, but he’s only been able to practice on a mountain shaped like a normal distribution with mean 65 and standard deviation 12. In his practice, Keenan always crouches down low when he reaches the point where his x-coordinate is 92, which helps him ski faster. When he competes at the Olympics, at what x-coordinate should he crouch down low, corresponding to the same relative location on the mountain?

Answer: 68

Since we know that both slopes are normal distributions (just scaled and shifted), we can derive this answer by writing Keenan’s crouch point in terms of standard deviations from the mean. He typically crouches at 92 feet, whose distance from the mean (in standard deviations) is given by \frac{92 - 65}{12} = 2.25. So, all we need to do is find what number is 2.25 standard deviations from the mean in the Olympic mountain. This is given by 50 + (2.25 * 8) = 68


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 15.3

Aaron is another Olympic downhill skier. When he competes on the normal curve mountain with mean 50 and standard deviation 8, he crouches down low when his x-coordinate is 54. If the total area of the mountain is 1, approximately how much of the mountain’s area is ahead of Aaron at the moment he crouches down low?

Answer: 0.3

We know that when Aaron reaches the mean (50), exactly 0.5 of the mountain’s area is behind him, since the mean and median are equal for normal distributions like this one. We also see that 54 is one half of a standard deviation away from the mean. So, all we have to do is find out what proportion of the area is within half a standard deviation of the mean. Using the 68-95-99.7 rule, we know that 68% of the values lie within one standard deviation of the mean to both the right and left side. So, this means 34% of the values are within one standard deviation on one side and at least 17% are within half a standard deviation on one side. Since the area is 1, the area would be 0.17. So, by the time Aaron reaches an x-coordinate of 54, 0.5 + 0.17 = 0.67 of the mountain is behind him. From here, we simply calculate the area in front by 1 - 0.67 = 0.33, so we conclude that approximately 0.3 of the area is in front of Aaron.

Note: As a clafrification, the 0.17 is an estimate, specifically, an underestimate, due to the shape of the normal distribution. Thse area under a normal distribution is not proportional to how many standard deviations far away from the mean you are.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.



Source: wi25-final — Q4

Problem 16

Beneath Gringotts Wizarding Bank, enchanted mine carts transport wizards through a complex underground railway on the way to their bank vault.

During one section of the journey to Harry’s vault, the track follows the shape of a normal curve, with a peak at x = 50 and a standard deviation of 20.


Problem 16.1

A ferocious dragon, who lives under this section of the railway, is equally likely to be located anywhere within this region. What is the probability that the dragon is located in a position with x \leq 10 or x \geq 80? Select all that apply.

Answer: 1 - (scipy.stats.norm.cdf(1.5) - scipy.stats.norm.cdf(-2)) & scipy.stats.norm.cdf(-2) + scipy.stats.norm.cdf(-1.5)

  • Option 1: This code calculates the probability that a value lies outside the range between z = -2 and z = 1.5, which corresponds to x \leq 10 or x \geq 80. This is done by subtracting the area under the normal curve between -2 and 1.5 from 1. This is correct because it accurately captures the combined probability in the left and right tails of the distribution.

  • Option 2: This code multiplies the cumulative distribution function (CDF) at z = 1.75 by 2. This assumes symmetry around the mean and is used for intervals like |z| \geq 1.75, but that’s not what we want. The correct z-values for this problem are -2 and 1.5, so this option is incorrect.

  • Option 3: This code adds the probability of z \leq -2 and z \geq 1.5, using the fact that P(z \geq 1.5) = P(z \leq -1.5) by symmetry. So, while the code appears to show both as left-tail calculations, it actually produces the correct total tail probability. This option is correct.

  • Option 4: This is a static value with no basis in the z-scores of -2 and 1.5. It’s likely meant as a distractor and does not represent the correct probability for the specified conditions. This option is incorrect.


Problem 16.2

Harry wants to know where, in this section of the track, the cart’s height is changing the fastest. He knows from his earlier public school education that the height changes the fastest at the inflection points of a normal distribution. Where are the inflection points in this section of the track?

Answer: x = 30 and x = 70

Recall that the inflection points of a normal distribution are located one standard deviation away from the mean. In this problem, the mean is x = 50 and the standard deviation is 20, so the inflection points occur at x = 30 and x = 70. These are the points where the curve changes concavity and where the height is changing the fastest. Therefore, the correct answer is x = 30 and x = 70.


Problem 16.3

Next, consider a different region of the track, where the shape follows some arbitrary distribution with mean 130 and standard deviation 30. We don’t have any information about the shape of the distribution, so it is not necessarily normal.

What is the minimum proportion of area under this section of the track within the range 100 \leq x \leq 190?

Answer: 0.00

We are told that the distribution is not necessarily normal. The mean is 130 and the standard deviation is 30. We’re asked for the minimum proportion of area between x = 100 and x = 190.

Since the distribution isn’t normal and we don’t know its shape, we can’t use the empirical rule (68-95-99.7) or z-scores. We might try using Chebyshev’s Inequality, but that only works for intervals that are equally far below the mean as above the mean. This interval is not like that (it’s 1 standard deviation below the mean and 2 above), so Chebyshev’s Inequality doesn’t apply. The most we can say using Chebyshev’s Inequality is that in the interval from 1 standard deviation below the mean to 1 standard deviation above the mean, we can get at least 1 - \frac{1}{0^2} = 0 percent of the data. We can’t make any additional guarantees. So, the minimum possible proportion of area is 0.00.