Lecture 13 — Practice

Below are practice problems tagged for Lecture 13 (rendered directly from the original exam/quiz sources).

Problem 1

Problem 1.1

Which of the following blocks of code correctly assigns random_art_museums to an array of the names of 10 art museums, randomly selected without replacement from those in art_museums? Select all that apply.

Answers: Option 1, Option 2, and Option 4

Note that if df is a DataFrame, then df.sample(10) is a DataFrame containing 10 randomly selected rows in df. With that in mind, let’s look at all of our options.

Option 1: This implementation of get_10 takes in a DataFrame df and returns an array containing 10 randomly selected values in df’s 'Name' column. After defining get_10, we assign random_art_museums to the result of calling get_10(art_museums). This assigns random_art_museums as intended, so Option 1 is correct.
Option 2: This option is functionally the same as Option 1, so it is also correct. The only difference between Option 2 and Option 1 is that Option 2 uses the parameter name art_museums and Option 1 uses the parameter name df (both in the def line and in the function body); this does not change the behavior of get_10 or the lines afterward.
Option 3: get_10 here does not return anything! So, get_10(art_museums) evaluates to None (which means “nothing” in Python), and random_art_museums is also None, meaning Option 3 is incorrect.
Option 4: At first, it may appear that this option is wrong, as get_10 does not take in any inputs. However, the body of get_10 contains a reference to the DataFrame art_museums, which is ultimately where we want to sample from. As a result, get_10 does indeed return an array containing 10 randomly selected museum names, and random_art_museums = get_10() correctly assigns random_art_museums to this array, so Option 4 is correct.
Option 5: Here, get_10 returns the correct array. However, outside of the function, random_art_museums is never assigned to the output of get_10. (The variable name random_art_museums inside the function has nothing to do with the array defined before and outside the function.) As a result, after running the line get_10() at the bottom of the code block, random_art_museums is still an empty array, and as such, Option 5 is incorrect.

Difficulty: ⭐️⭐️

The average score on this problem was 85%.

Problem 1.2

London has the most art museums in the top 100 of any city in the world. The most visited art museum in London is 'Tate Modern'.

Which of the following blocks of code correctly assigns best_in_london to 'Tate Modern'? Select all that apply.

Answer: Option 1 only

At a glance, it may seem like there’s a lot of reading to do to answer the question. However, it turns out that all 3 options follow similar logic; the difference is in their use of print and return statements. Whenever we want to “save” the output of a function to a variable name or use it in another function, we need to return somewhere within our function. Only Option 1 contains a return statement in both most_common and most_visited, so it is the only correct option.

Let’s walk through the logic of Option 1 (which we don’t necessarily need to do to answer the problem, but we should in order to enhance our understanding):

First, we use most_common to find the city with the most art museums. most_common does this by grouping the input DataFrame df (art_museums, in this case) by 'City' and using the .count() method to find the number of rows per 'City'. Note that when using .count(), all columns in the aggregated DataFrame will contain the same information, so it doesn’t matter which column you use to extract the counts per group. After sorting by one of these columns ('Rank', in this case) in decreasing order, most_common takes the first value in the index, which will be the name of the 'City' with the most art museums. This is London, i.e. most_common(art_museums, 'City') evaluates to 'London' in Option 1 (in Option 2, it evaluates to None, since most_common there doesn’t return anything).
Then, we use most_visited to find the museum with the most visitors in the city with the most museums. This is achieved by keeping only the rows of the input DataFrame df (again, art_museums in this case) where the value in the col ('City') column is value (most_common(art_museums, 'City'), or 'London'). Now that we only have information for museums in London, we can sort by 'Visitors' to find the most visited such museum, and take the first value from the resulting 'Name' column. While all 3 options follow this logic, only Option 1 returns the desired value, and so only Option 1 assigns best_in_london correctly. (Even if Option 2’s most_visited used return instead of print, it still wouldn’t work, since Option 2’s most_common also uses print instead of return).

Difficulty: ⭐️⭐️

The average score on this problem was 86%.

Problem 2

Since txn has 140,000 rows, Jack wants to get a quick glimpse at the data by looking at a simple random sample of 10 rows from txn. He defines the DataFrame ten_txns as follows:

Which of the following code blocks also assign ten_txns to a simple random sample of 10 rows from txn?

Answer: Option 1, Option 2, and Option 3.

Let’s consider each option.

Option 1: First, all_rows is defined as an array containing the integer positions of all the rows in the DataFrame. Then, we randomly shuffle the elements in this array and store it in the array permutations. Finally, we select 10 integers randomly (without replacement), and use .take() to select the rows from the DataFrame with the corresponding integer locations. In other words, we are randomly selecting ten row numbers and taking those randomly selected. This gives a simple random sample of 10 rows from the DataFrame txn, so option 1 is correct.
Option 2: Option 2 is similar to option 1, except that the order of the np.random.choice and the np.random.permutation operations are switched. This doesn’t affect the output, since the choice we made was, by definition, random. Therefore, it doesn’t matter if we shuffle the rows before or after (or not at all), since the most this will do is change the order of a sample which was already randomly selected. So, option 2 is correct.
Option 3: Here, we randomly shuffle the elements of all_rows, and then we select the first 10 elements with np.take. Since the shuffling of elements from all_rows was random, we don’t know which elements are in the first 10 positions of this new shuffled array (in other words, the first 10 elements are random). So, when we select the rows from txn which have the corresponding integer locations in the next step, we’ve simply selected 10 rows with random integer locations. Therefore, this is a valid random sample from txn, and option 3 is correct.
Option 4: The difference between this option and option 3 is the order in which np.random.permutation and np.take are executed. Here, we select the first 10 elements before the permutation (inside the parentheses). As a result, the array which we’re shuffling with np.random.permutation does not include all the integer locations like all_rows does, it’s simply the first ten elements. Therefore, this code produces a random shuffling of the first 10 rows of txn, which is not a random sample.

Difficulty: ⭐️⭐️

The average score on this problem was 82%.

Problem 3

We want to use app_data to estimate the average amount of time it takes to build an IKEA bed (any product in the 'bed' category). Which of the following strategies would be an appropriate way to estimate this quantity? Select all that apply.

Answer: Option 1

Only the first answer is correct. This is a question of parameter estimation, so our approach is to use bootstrapping to create many resamples of our original sample, computing the average of each resample. Each resample should always be the same size as the original sample. The first answer choice accomplishes this by querying first to keep only the beds, then resampling from the DataFrame of beds only. This means resamples will have the same size as the original sample. Each resample’s mean will be computed, so we will have many resample means from which to construct our 95% confidence interval.

In the second answer choice, we are actually taking the mean twice. We first average the build times for all builds of the same product when grouping by product. This produces a DataFrame of different products with the average build time for each. We then resample from this DataFrame, computing the average of each resample. But this is a resample of products, not of product builds. The size of the resample is the number of unique products in app_data, not the number of reported product builds in app_data. Further, we get incorrect results by averaging numbers that are already averages. For example, if 5 people build bed A and it takes them each 1 hour, and 1 person builds bed B and it takes them 10 hours, the average amount of time to build a bed is \frac{5*1+10}{6} = 2.5. But if we average the times for bed A (1 hour) and average the times for bed B (5 hours), then average those, we get \frac{1+5}{2} = 3, which is not the same. More generally, grouping is not a part of the bootstrapping process because we want each data value to be weighted equally.

The last two answer choices are incorrect because they involve resampling from the full app_data DataFrame before querying to keep only the beds. This is incorrect because it does not preserve the sample size. For example, if app_data contains 1000 reported bed builds and 4000 other product builds, then the only relevant data is the 1000 bed build times, so when we resample, we want to consider another set of 1000 beds. If we resample from the full app_data DataFrame, our resample will contain 5000 rows, but the number of beds will be random, not necessarily 1000. If we query first to keep only the beds, then resample, our resample will contain exactly 1000 beds every time. As an added bonus, since we only care about beds, it’s much faster to resample from a smaller DataFrame of beds only than it is to resample from all app_data with plenty of rows we don’t care about.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.

Problem 4

Answer each of these questions by either writing a single number in the box or selecting “Not enough information”, but not both. What is the probability that:

Problem 4.1

Answer: 0.15

Let’s first understand the function. The function takes inputs n, lower, and upper and randomly takes a sample of n rows with replacement from DataFrame df, gets column longevity from the sample and saves it as a Series t. The n entries of t are randomly generated according to the density histogram shown in the picture. That is, the probability of a particular value being generated in Series t for a given entry can be visualized by the density histogram in the picture. lower <= t < upper takes t and generates a Series of boolean values, either True or False depending on whether the corresponding entry in t lies within the range. And so sum(lower <= t < upper) returns the number of entries in t that lies between the range values. (This is because True has a value of 1 and False has a value of 0, so summing Booleans is a quick way to count how many True there are.)

Now part a is just asking for the probability that we’ll draw a longevity value (given that n is 1, so we only draw one longevity value) between 10 and 11 given the density plot. Note that the probability of a bar is given by the width of the bar multiplied by the height. Now looking at the bar with bin of range 10 to 11, we can see that the probability is just (11-10) * 0.15 = 1 * 0.15 = 0.15.

Difficulty: ⭐️⭐️

The average score on this problem was 86%.

Problem 4.2

Answer: 0.36

Part b is essentially asking us: What is the probability that after drawing two longevity values according to the density plot, both of them will lie in between 0 and 12?

Let’s first start by considering the probability of drawing 1 longevity value that lies between 0 and 12. This is simply just the sum of the areas of the three bars of range 6-10, 10-11, and 11-12, which is just (4*0.05) + (1*0.15) + (1*0.25) = 0.6

Now because we draw each value independently from one another, we simply square this probability which gives us an answer of 0.6*0.6 = 0.36

Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Problem 4.3

Answer: 0.19

Part c is essentially asking us: What is the probability that after drawing two longevity values according to the density plot, at least one of them will lie in between 12 and 20?

While you can directly solve for this probability, a faster method would be to solve for the complementary of this problem. That is, we can solve for the probability that none of them lie in between the given ranges. And once we solve this, we can simply subtract our answer from one, because the only options for this scenario is that either at least one of the values lie in between the range, or neither of the values do.

Again, let’s solve for the probability of drawing 1 longevity value that isn’t between the range. Staying true to our complementary strategy, this is just 1 minus the probability of drawing a longevity value that is in the range, which is just 1 - (1*0.05+1*0.05) = 0.9

Again, because we draw each value independently, squaring this probability gives us the probability that neither of our drawn values are in the range, or 0.9*0.9 = 0.81. Finally, subtracting this from 1 gives us our desired answer or 1 - 0.81 =0.19

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.

Problem 5

You sample from a population by assigning each element of the population a number starting with 1. You include element 1 in your sample. Then you generate a random number, n, between 2 and 5, inclusive, and you take every nth element after element 1 to be in your sample. For example, if you select n=2, then your sample will be elements 1, 3, 5, 7, and so on.

Problem 5.1

True or False: Before the sample is drawn, you can calculate the probability of selecting each subset of the population.

Answer: True

The answer is true since someone can easily sketch each sample to view the probability of selecting a certain subset. For example, when n = 2 we know the elements are 1, 3, 5, 7, and so on. Similarly we know this information for n = 3, 4 and 5. Using this information we could calculate the probability of selecting a subset.

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 5.2

True or False: Each subset of the population is equally likely to be selected.

Answer: False

No, each subset of the population is not equally likely to be selected since the element assigned as element 1 will always be selected due to the way sampling is conducted as defined in the question. That is, the question says we always include element one in the sample which will over represent it in samples as compared to other parts of the population.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.

Problem 6

Assume df is a DataFrame with distinct rows. Which of the following best describes df.sample(10)?

Answer: a DataFrame with 10 rows, where no two rows can be the same

Looking at the documentation for .sample() we can see that it accepts a few arguments. The first argument specifies the number of rows (which is why we specify 10). The next argument is a boolean that specifies if the sampling happens with or without replacement. By default, the sampling will occur without replacement (which happens in this question since no argument is specified so the default is evoked). Looking at the return, we can see that since we are sampling a dataframe, a dataframe will also be returned which is why a DataFrame with 10 rows, where no two rows can be the same is correct.

Difficulty: ⭐️

The average score on this problem was 94%.

Problem 7

Answer: a simple random sample of size 10, chosen from a set of size 1000 with replacement

Let’s see what the code is doing. The first line initializes an empty array called results. The for loop runs 10 times. Each time, it creates a value called result by some process we’ll inspect shortly and appends this value to the end of the results array. At the end of the code snippet, results will be an array containing 10 elements.

Now, let’s look at the process by which each element result is generated. Each result is a random element chosen from np.arange(1000) which is the numbers from 0 to 999, inclusive. That’s 1000 possible numbers. Each time np.random.choice is called, just one value is chosen from this set of 1000 possible numbers.

When we sample just one element from a set of values, sampling with replacement is the same as sampling without replacement, because sampling with or without replacement concerns whether subsequent draws can be the same as previous ones. When we’re just sampling one element, it really doesn’t matter whether our process involves putting that element back, as we’re not going to draw again!

Therefore, result is just one random number chosen from the 1000 possible numbers. Each time the for loop executes, result gets set to a random number chosen from the 1000 possible numbers. It is possible (though unlikely) that the random result of the first execution of the loop matches the result of the second execution of the loop. More generally, there can be repeated values in the results array since each entry of this array is independently drawn from the same set of possibilities. Since repetitions are possible, this means the sample is drawn with replacement.

Therefore, the results array contains a sample of size 10 chosen from a set of size 1000 with replacement. This is called a “simple random sample” because each possible sample of 10 values is equally likely, which comes from the fact that np.random.choice chooses each possible value with equal probability by default.

Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 11%.

Problem 8

Describe in your own words the difference between a probability distribution and an empirical distribution. Give an example of what each distribution might look like for a certain experiment. Choose an experiment that we have not already seen in this class.

Answer: There are many possible correct answers. Below are some student responses that earned full credit, lightly edited for clarity.

Probability distributions are theoretical distributions distributed over all possible values of an experiment. Meanwhile, empirical distributions are distributions of the real observed data. An example of this would be choosing a certain suit from a deck of cards. The probability distribution would be uniform, with a 1/4 chance of choosing each suit. Meanwhile, the empirical distribution of choosing suits from a deck of cards in 50 pulls manually and graphing the observed data would show us different chances.

A probability distribution is the distribution describing the theoretical probability of each potential value occurring in an experiment, while the empirical distribution describes the proportion of each of the values in the experiment after running it, including all observed values. In other words, the probability distribution is what we expect to happen, and the empirical distribution is what actually happens.

For example: My friends and I often go to a food court to eat, and we randomly pick a restaurant every time. There is 1 McDonald’s, 1 Subway, and 2 Panda Express restaurants in the food court.

The probability distribution is as follows:

P(McDonald’s) = 0.25
P(Subway) = 0.25
P(Panda Express) = 0.5

After going to the food court 100 times, we look at the empirical distribution to see which restaurants we eat at most often. it is as follows:

(McDonald’s) = 0.21
(Subway) = 0.22
(Panda Express) = 0.57

Probability distribution is a theoretical representation of certain outcomes in an event whereas an empirical distribution is the observational representation of the same outcomes in an event produced from an experiment.

An example would be if I had 10 pairs of shoes in my closet: The probability distribution would suggest that each pair of shoes has an equal chance of getting picked on any given day. On the other hand, an empirical distribution would be drawn by recording which pair got picked on a given day in N trials.

Difficulty: ⭐️⭐️

The average score on this problem was 82%.

Problem 9

Suppose we take a uniform random sample with replacement from a population, and use the sample mean as an estimate for the population mean. Which of the following is correct?

Answer: If we take a larger sample, our sample mean is more likely to be close to the population mean than if we take a smaller sample.

Larger samples tend to give better estimates of the population mean than smaller samples. That’s because large samples are more like the population than small samples. We can see this in the extreme. Imagine a sample of 1 element from a population. The sample might vary a lot, depending on the distribution of the population. On the other extreme, if we sample the whole population, our sample mean will be exactly the same as the population mean.

Notice that the correct answer choice uses the words “is more likely to be close to” as opposed to “will be closer to.” We’re talking about a general phenomenon here: larger samples tend to give better estimates of the population mean than smaller samples. We cannot say that if we take a larger sample our sample mean “will be closer to” the population mean, since it’s always possible to get lucky with a small sample and unlucky with a large sample. That is, one particular small sample may happen to have a mean very close to the population mean, and one particular large sample may happen to have a mean that’s not so close to the population mean. This can happen, it’s just not likely to.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 10

As in the previous question, let coop_sample be a sample of 100 rows of games, all corresponding to cooperative games.

Problem 10.1

Which of the following statements could evaluate to True? Select all that are possible.

Answer: Options 2 and 3

Option 2: This is correct because it is possible for resamp to be shuffled in such a way that the number of unique elements are not the same.

Option 3: This is correct because it is possible for resamp to pull the same values more often making it less unique than samp.

Option 1: The reason that this is incorrect is because samp.unique() has the most possible unique elements inside of it. When we shuffle it using coop_sample.sample(100, replace = True) we could pull the same value multiple times, making it less unique.

Difficulty: ⭐️

The average score on this problem was 91%.

Problem 10.2

Which of the following statements could evaluate to True? Select all that are possible.

Answer: Options 1, 2, and 3

Option 1: It might be helpful to recall what exactly the column “Complexity” holds. In this case it holds the average complexity of the game on a scale of 1 to 5. The code is trying to find if the number of ones in samp and resamp are different. It is possible that when shuffling due to replace = True that resamp has more ones inside of it than samp.

Option 2: Once again it is possible that when shuffled resamp has the same number of ones as samp does.

Option 3: When we shuffle coop_sample there is no guarantee that one will sample more ones and instead other averages could be selected. This means it is possible for the number of ones in samp can be greater than the number of ones in resamp.

Difficulty: ⭐️⭐️

The average score on this problem was 83%.

Problem 10.3

Which of the following statements could evaluate to True? Select all that are possible.

Answer: Options 1 and 2

Option 1: It is possible when shuffled that samp’s original minimum is never sampled, making resamp’s minimum to be greater than samp’s min.

Option 2:: If samp’s original min is sampled then it will be the same minimum that appears inside of resamp.

Option 3: It is impossible for resamp’s minimum to be less than samp’s minimum. This is because all of resamp’s values come from samp. That means there cannot be a smaller average inside of resamp that never appears in samp.

Difficulty: ⭐️⭐️

The average score on this problem was 83%.

Problem 10.4

Which of the following statements could evaluate to True? Select all that are possible.

Answer: Options 1, 2, and 3

Option 1: np.std() gives us the standard deviation of the array we give it. When we do np.std(samp) we are finding the standard deviation of “Complexity”. When we do np.std(resamp) we are finding the standard deviation of “Complexity”, which may grab values multiple times. Since we are grabbing values multiple times it is possible to have a standard deviation become smaller if we continuously grab smaller values.

Option 2: If the resamp gets us the same values as samp we would end up with the same standard deviation, which would make np.std(samp) == np.std(resamp).

Option 3: Similar to Option 1, we may grab many values which are on the larger end, which could increase our standard deviation.

Difficulty: ⭐️⭐️

The average score on this problem was 79%.

Problem 11

Suppose the function simulate_lcr from the last question has been correctly implemented, and we want to use it to see how many turns a game of Left, Center, Right usually takes.

Note: You can answer this question even if you couldn’t answer the previous one.

Problem 11.1

Does this histogram show a probability distribution or an empirical distribution?

Answer: Empirical Distribution

An empirical distribution is derived from observed data, in this case, the results of 10,000 simulated games of Left, Center, Right. It represents the frequencies of outcomes (number of turns taken in each game) as observed in these simulations.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.

Problem 11.2

What is the probability of a game of Left, Center, Right lasting 30 turns or more? Choose the closest answer below.

Answer: 0.06

We’re being asked to find the proportion of values in the histogram that are greater than or equal to 30, which is equal to the area of the histogram to the right of 30. Immediately, we can rule out 0.01 and 0.60, because the area to the right of 30 is more than 1% of the total area and less than 60% of the total area.

The problem then boils down to determining whether the area to the right of 30 is 0.06 or 0.10. While you could solve this by finding the areas of the three bars individually and adding them together, there’s a quicker solution. Notice that the x-axis gridlines – the vertical lines in the background in white – appear every 10 units (at x = 0, x = 10, x = 20, x = 30, and so on) and the y-axis gridlines – the horizontal lines in the background in white – appear every 0.01 units (at y = 0, y = 0.01, y = 0.02, and so on). There’s a “box” in the grid between x = 30 and x = 40, and between y = 0 and y = 0.01. The area of that box is (40 - 30) \cdot 0.01 = 0.1, which means that if a bar book up the entire box, then 10% of the values in this distribution would fall into that bar’s bin.

So, to decide whether the area to the right of 30 is closer to 0.06 or 0.1, we can estimate whether the three bars to the right of 30 would fill up the entire box described above (that is, the box from 30 to 40 on the x-axis and 0 to 0.1 on the y-axis), or whether it would be much emptier. Visually, if you broke off the area that is to the right of 40 in the histogram and put it in the box we’ve just described, then quite a bit of the box would still be empty. As such, the area to the right of 30 is less than the area of the box, so it’s less than 0.1, and so the only valid option is 0.06.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.

Problem 11.3

Suppose a player with n chips takes their turn. What is the probability that they will have to put at least one chip into the center? Give your answer as a mathematical expression involving n.

Answer: 1 - (\frac{5}{6})^n

Recall that the die used to play this game has six sides: L, C, R, Dot, Dot, Dot. The chance of getting C is \frac{1}{6}. So we can take the compliment of that to get \frac{5}{6}, which is the probability of not putting at least one chip into the center and then doing (\frac{5}{6})^n. Once again we must use the complement rule as to convert it back to the probability of putting at least one chip into the center. This gives us the answer: 1 - (\frac{5}{6})^n

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

Problem 11.4

Suppose a player with n chips takes their turn. What is the probability that they will end their turn with n chips? Give your answer as a mathematical expression involving n.

Lecture 13 — Collected Practice Questions

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 4

Problem 4.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 4.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 4.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 5

Problem 5.1

Click to view the solution.

Difficulty: ⭐️

Problem 5.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 6

Click to view the solution.

Difficulty: ⭐️

Problem 7

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️⭐️

Problem 8

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 9

Click to view the solution.

Difficulty: ⭐️

Problem 10

Problem 10.1

Click to view the solution.

Difficulty: ⭐️

Problem 10.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 10.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 10.4

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 11

Problem 11.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 11.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 11.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 11.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️