← return to practice.dsc10.com

**Instructor(s):** Janine Tiefenbruck, Suraj Rampure

This exam was administered remotely via Gradescope. The exam was
open-internet, and students were able to use Jupyter Notebooks. They had
**3 hours** to work on it.

Welcome to the Final Exam! In honor of the (at the time) brand-new Comic-Con Museum that just opened at Balboa Park here in San Diego, this exam will contain questions about various museums and zoos around the world.

In this question, we’ll work with the DataFrame
`art_museums`

, which contains the name, city, number of
visitors in 2019, and rank (based on number of visitors) for the 100
most visited art museums in 2019. The first few rows of
`art_museums`

are shown below.

**Tip:** Open this page in another tab, so that it is
easy to refer to this data description as you work through the exam.

Which of the following blocks of code correctly assigns
`random_art_museums`

to an array of the names of 10 art
museums, randomly selected without replacement from those in
`art_museums`

? Select all that apply.

Option 1:

```
def get_10(df):
return np.array(df.sample(10).get('Name'))
= get_10(art_museums) random_art_museums
```

Option 2:

```
def get_10(art_museums):
return np.array(art_museums.sample(10).get('Name'))
= get_10(art_museums) random_art_museums
```

Option 3:

```
def get_10(art_museums):
= np.array(art_museums.sample(10).get('Name'))
random_art_museums
= get_10(art_museums) random_art_museums
```

Option 4:

```
def get_10():
return np.array(art_museums.sample(10).get('Name'))
= get_10() random_art_museums
```

Option 5:

```
= np.array([])
random_art_museums
def get_10():
= np.array(art_museums.sample(10).get('Name'))
random_art_museums return random_art_museums
get_10()
```

Option 1

Option 2

Option 3

Option 4

Option 5

None of the above

**Answers:** Option 1, Option 2, and Option 4

Note that if `df`

is a DataFrame, then
`df.sample(10)`

is a DataFrame containing 10 randomly
selected rows in `df`

. With that in mind, let’s look at all
of our options.

**Option 1:**This implementation of`get_10`

takes in a DataFrame`df`

and returns an array containing 10 randomly selected values in`df`

’s`'Name'`

column. After defining`get_10`

, we assign`random_art_museums`

to the result of calling`get_10(art_museums)`

. This assigns`random_art_museums`

as intended, so Option 1 is correct.**Option 2:**This option is functionally the same as Option 1, so it is also correct. The only difference between Option 2 and Option 1 is that Option 2 uses the parameter name`art_museums`

and Option 1 uses the parameter name`df`

(both in the`def`

line and in the function body); this does not change the behavior of`get_10`

or the lines afterward.**Option 3:**`get_10`

here does not return anything! So,`get_10(art_museums)`

evaluates to`None`

(which means “nothing” in Python), and`random_art_museums`

is also`None`

, meaning Option 3 is incorrect.**Option 4:**At first, it may appear that this option is wrong, as`get_10`

does not take in any inputs. However, the body of`get_10`

contains a reference to the DataFrame`art_museums`

, which is ultimately where we want to sample from. As a result,`get_10`

does indeed return an array containing 10 randomly selected museum names, and`random_art_museums = get_10()`

correctly assigns`random_art_museums`

to this array, so Option 4 is correct.**Option 5:**Here,`get_10`

returns the correct array. However, outside of the function,`random_art_museums`

is never assigned to the output of`get_10`

. (The variable name`random_art_museums`

inside the function has nothing to do with the array defined before and outside the function.) As a result, after running the line`get_10()`

at the bottom of the code block,`random_art_museums`

is still an empty array, and as such, Option 5 is incorrect.

The average score on this problem was 85%.

London has the most art museums in the top 100 of any city in the
world. The most visited art museum in London is
`'Tate Modern'`

.

Which of the following blocks of code correctly assigns
`best_in_london`

to `'Tate Modern'`

? Select all
that apply.

Option 1:

```
def most_common(df, col):
return df.groupby(col).count().sort_values(by='Rank', ascending=False).index[0]
def most_visited(df, col, value):
return df[df.get(col)==value].sort_values(by='Visitors', ascending=False).get('Name').iloc[0]
= most_visited(art_museums, 'City', most_common(art_museums, 'City')) best_in_london
```

Option 2:

```
def most_common(df, col):
print(df.groupby(col).count().sort_values(by='Rank', ascending=False).index[0])
def most_visited(df, col, value):
print(df[df.get(col)==value].sort_values(by='Visitors', ascending=False).get('Name').iloc[0])
= most_visited(art_museums, 'City', most_common(art_museums, 'City')) best_in_london
```

Option 3:

```
def most_common(df, col):
return df.groupby(col).count().sort_values(by='Rank', ascending=False).index[0]
def most_visited(df, col, value):
print(df[df.get(col)==value].sort_values(by='Visitors', ascending=False).get('Name').iloc[0])
= most_visited(art_museums, 'City', most_common(art_museums, 'City')) best_in_london
```

Option 1

Option 2

Option 3

None of the above

**Answer:** Option 1 only

At a glance, it may seem like there’s a lot of reading to do to
answer the question. However, it turns out that all 3 options follow
similar logic; the difference is in their use of `print`

and
`return`

statements. Whenever we want to “save” the output of
a function to a variable name or use it in another function, we need to
`return`

somewhere within our function. Only Option 1
contains a `return`

statement in both
`most_common`

and `most_visited`

, so it is the
only correct option.

Let’s walk through the logic of Option 1 (which we don’t necessarily need to do to answer the problem, but we should in order to enhance our understanding):

- First, we use
`most_common`

to find the city with the most art museums.`most_common`

does this by grouping the input DataFrame`df`

(`art_museums`

, in this case) by`'City'`

and using the`.count()`

method to find the number of rows per`'City'`

. Note that when using`.count()`

, all columns in the aggregated DataFrame will contain the same information, so it doesn’t matter which column you use to extract the counts per group. After sorting by one of these columns (`'Rank'`

, in this case) in decreasing order,`most_common`

takes the first value in the`index`

, which will be the name of the`'City'`

with the most art museums.**This is London**, i.e.`most_common(art_museums, 'City')`

evaluates to`'London'`

in Option 1 (in Option 2, it evaluates to`None`

, since`most_common`

there doesn’t`return`

anything). - Then, we use
`most_visited`

to find the museum with the most visitors in the city with the most museums. This is achieved by keeping only the rows of the input DataFrame`df`

(again,`art_museums`

in this case) where the value in the`col`

(`'City'`

) column is`value`

(`most_common(art_museums, 'City')`

, or`'London'`

). Now that we only have information for museums in London, we can sort by`'Visitors'`

to find the most visited such museum, and take the first value from the resulting`'Name'`

column. While all 3 options follow this logic, only Option 1**returns**the desired value, and so only Option 1 assigns`best_in_london`

correctly. (Even if Option 2’s`most_visited`

used`return`

instead of`print`

, it still wouldn’t work, since Option 2’s`most_common`

also uses`print`

instead of`return`

).

The average score on this problem was 86%.

In this question, we’ll keep working with the
`art_museums`

DataFrame.

(Remember to keep the data description from the top of the exam open in another tab!)

`'Tate Modern'`

is the most popular art museum in London.
But what’s the most popular art museum in each city?

It turns out that there’s no way to answer this easily using the
tools that you know about so far. To help, we’ve created a new Series
method, `.last()`

. If `s`

is a Series,
`s.last()`

returns the last element of `s`

(i.e. the element at the very end of `s`

).
`.last()`

works with `.groupby`

, too (just like
`.mean()`

and `.count()`

).

Fill in the blanks so that the code below correctly assigns
`best_per_city`

to a DataFrame with one row per city, that
describes the name, number of visitors, and rank of the most visited art
museum in each city. `best_per_city`

should be sorted in
decreasing order of number of visitors. The first few rows of
`best_per_city`

are shown below.

`= __(a)__.groupby(__(b)__).last().__(c)__ best_per_city `

- What goes in blank (a)?
- What goes in blank (b)?
- What goes in blank (c)?

**Answers:**

`art_museums.sort_values('Visitors', ascending=True)`

`'City'`

`sort_values('Visitors', ascending=False)`

Let’s take a look at the completed implementation.

`= art_museums.sort_values('Visitors', ascending=True).groupby('City').last().sort_values('Visitors', ascending=False) best_per_city `

We first sort the row in `art_museums`

by the number of
`'Visitors'`

in `ascending`

order. Then
`goupby('City')`

, so that we have one row per city. Recall,
we need an aggregation method after using `groupby()`

. In
this question, we use `last()`

to only keep the one last row
for each city. Since in blank (a) we have sorted the rows by
`'Visitors'`

, so `last()`

keeps the row that
contains the name of the most visited museum in that city. At last we
sort this new DataFrame by `'Visitors'`

in descending order
to fulfill the question’s requirement.

The average score on this problem was 65%.

Assume you’ve defined `best_per_city`

correctly.

Which of the following options evaluates to the number of visitors to the most visited art museum in Amsterdam? Select all that apply.

`best_per_city.get('Visitors').loc['Amsterdam']`

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[0]`

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[-1]`

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').loc['Amsterdam']`

None of the above

**Answer:**
`best_per_city.get('Visitors').loc['Amsterdam']`

,
`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[0]`

,
`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[-1]`

,
`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').loc['Amsterdam']`

(Select all except “None of the above”)

`best_per_city.get('Visitors').loc['Amsterdam']`

We first
use `.get(column_name)`

to get a series with number of
visitors to the most visited art museum, and then locate the number of
visitors to the most visited art museum in Amsterdam using
`.loc[index]`

since we have `"City"`

as index.

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[0]`

We first query the `best_per_city`

to only include the
DataFrame with one row with index `'Amsterdam'`

. Then, we get
the `'Visitors'`

column of this DataFrame. Finally, we use
`iloc[0]`

to access the first and the only value in this
column.

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[-1]`

We first query the `best_per_city`

to only include the
DataFrame with one row with index `'Amsterdam'`

. Then, we get
the `'Visitors'`

column of this DataFrame. Finally, we use
`iloc[-1]`

to access the last and the only value in this
column.

`best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').loc['Amsterdam']`

We first query the `best_per_city`

to only include the
DataFrame with one row with index `'Amsterdam'`

. Then, we get
the `'Visitors'`

column of this DataFrame. Finally, we use
`loc['Amsterdam']`

to access the value in this column with
index `'Amsterdam'`

.

The average score on this problem was 84%.

The table below shows the average amount of revenue from different sources for art museums in 2003 and 2013.

What is the total variation distance between the distribution of
revenue sources in 2003 and the distribution of revenue sources in 2013?
Give your answer as a proportion (i.e. a decimal between 0 and 1),
**not** a percentage. Round your answer to three decimal
places.

**Answer:** 0.19

Recall, the total variation distance (TVD) is the sum of the absolute differences in proportions, divided by 2. The absolute differences in proportions for each source are as follows:

- Admissions: |0.15 - 0.24| = 0.09
- Restaurants and Catering: |0.09 - 0.12| = 0.03
- Store: |0.52 - 0.33| = 0.19
- Other: |0.24 - 0.31| = 0.07

Then, we have

\text{TVD} = \frac{1}{2} (0.09 + 0.03 + 0.19 + 0.07) = 0.19

The average score on this problem was 95%.

Which type of visualization would be best suited for comparing the two distributions in the table?

Scatter plot

Line plot

Overlaid histogram

Overlaid bar chart

**Answer:** Overlaid bar chart

A **scatter plot** visualizes the relationship between
two numerical variables. In this problem, we only have to visualize the
distribution of a categorical variable.

A **line plot** shows trends in numerical variables over
time. In this problem, we only have categorical variables. Moreover,
when it says over time, it is suitable for plotting change in multiple
years (e.g. 2001, 2002, 2003, … , 2013), or even with data of days. In
this question, we only want to compare the distribution of 2003 and
2013, this makes the line plot not useful. In addition, if you try to
draw a line plot for this question, you will find the line plot fails to
visualize distribution (e.g. the idea of 15%, 9%, 52%, and 24% add up to
100%).

An overlaid graph is useful in this question since this visualizes comparison between the two distributions.

However, an **overlaid histogram** is not useful in this
problem. The key reason is the differences between a histogram and a bar
chart.

Bar Chart: Space between the bars; 1 categorical axis, 1 numerical
axis; order does not matter

Histogram: No space between the bars; intervals on axis; 2 numerical
axes; order matters

In the question, we are plotting 2003 and 2013 distributions of four
categories (Admissions, Restaurants and Catering, Store, and Other).
Thus, an **overlaid bar chart** is more appropriate.

The average score on this problem was 74%.

**Note: This problem is out of scope; it
covers material no longer included in the course.**

Notably, there was an economic recession in 2008-2009. Which of the following can we conclude was an effect of the recession?

The increase in revenue from admissions, as more people were visiting museums.

The decline in revenue from museum stores, as people had less money to spend.

The decline in total revenue, as fewer people were visiting museums.

None of the above

**Answer:** None of the above

Since we are only given the distribution of the revenue, and have no information about the amount of revenue in 2003 and 2013, we cannot conclude how the revenue has changed from 2003 to 2013 after the recession.

For instance, if the total revenue in 2003 was 100 billion USD and the total revenue in 2013 was 50 billion USD, revenue from admissions in 2003 was 100 * 15% = 15 billion USD, and revenue from admissions in 2003 was 50 * 24% = 12 billion USD. In this case, we will have 15 > 12, the revenue from admissions has declined rather than increased (As stated by ‘The increase in revenue from admissions, as more people were visiting museums.’). Similarly, since we don’t know the total revenue in 2003 and 2013, we cannot conclude ‘The decline in revenue from museum stores, as people had less money to spend.’ or ‘The decline in total revenue, as fewer people were visiting museums.’

The average score on this problem was 72%.

The Museum of Natural History has a large collection of dinosaur
bones, and they know the approximate year each bone is from. They want
to use this sample of dinosaur bones to estimate **the total
number of years that dinosaurs lived on Earth**. We’ll make the
assumption that the sample is a uniform random sample from the
population of all dinosaur bones. Which statistic below will give the
best estimate of the population parameter?

sample sum

sample max - sample min

2 \cdot (sample mean - sample min)

2 \cdot (sample max - sample mean)

2 \cdot sample mean

2 \cdot sample median

**Answer:** sample max - sample min

Our goal is to estimate **the total number of years that
dinosaurs lived on Earth**. In other words, we want to know the
range of time that dinosaurs lived on Earth, and by definition range =
biggest value - smallest value. By using “sample max - sample min”, we
calculate the difference between the earliest and the latest dinosaur
bones in this uniform random sample, which helps us to estimate the
population range.

The average score on this problem was 52%.

The curator at the Museum of Natural History, who happens to have taken a data science course in college, points out that the estimate of the parameter obtained from this sample could certainly have come out differently, if the museum had started with a different sample of bones. The curator suggests trying to understand the distribution of the sample statistic. Which of the following would be an appropriate way to create that distribution?

bootstrapping the original sample

using the Centrual Limit Theorem

both bootstrapping and the Central Limit Theorem

neither bootstrapping nor the Central Limit Theorem

**Answer:** neither bootstrapping nor the Central Limit
Theorem

Recall, the Central Limit Theorem (CLT) says that the probability
distribution of **the sum or average** of a large random
sample drawn with replacement will be roughly normal, regardless of the
distribution of the population from which the sample is drawn. Thus, the
theorem only applies when our sample statistics is sum or average, while
in this question, our statistics is range, so CLT does not apply.

Bootstrapping is a valid technique for estimating measures of central tendency, e.g. the population mean, median, standard deviation, etc. It doesn’t work well in estimating extreme or sensitive values, like the population maximum or minimum. Since the statistic we’re trying to estimate is the difference between the population maximum and population minimum, bootstrapping is not appropriate.

The average score on this problem was 20%.

Now, the Museum of Natural History wants to know how many visitors they have in a year. However, their computer systems are rather archaic and so they aren’t able to keep track of the number of tickets sold for an entire year. Instead, they randomly select five days in the year, and keep track of the number of visitors on those days. Let’s call these numbers v_1, v_2, v_3, v_4, and v_5.

Which of the following is the best estimate the number of visitors for the entire year?

v_1 + v_2 + v_3 + v_4 + v_5

\frac{5}{365}\cdot(v_1 + v_2 + v_3 + v_4 + v_5)

\frac{365}{5}\cdot(v_1 + v_2 + v_3 + v_4 + v_5)

365\cdot v_3

**Answer:** \frac{365}{5}\cdot(v_1 + v_2 + v_3 + v_4 +
v_5)

Our sample is the number of visitors on the five days, and our population is the number of visitors in all 365 days.

First, we calculate the sample mean, the average number of visitors in the 5 days, which is m = \frac{1}{5}\cdot(v_1 + v_2 + v_3 + v_4 + v_5). We use this statistic to estimate the population mean, the average number of visitors in this year.

Then, we use the estimated population mean to calculate the estimated pupulation sum, so we multiply the number of days in a year (365) with the estimated population mean. We get 365 m = \frac{365}{5}\cdot(v_1 + v_2 + v_3 + v_4 + v_5)

The average score on this problem was 92%.

Now we’re interested in predicting the admission cost of a museum based on its number of visitors. Suppose:

admission cost and number of visitors are linearly associated with a correlation coefficient of 0.25,

the number of visitors at the Museum of Natural History is six standard deviations below average,

the average cost of museum admission is 15 dollars, and

the standard deviation of admission cost is 3 dollars.

What would the regression line predict for the admission cost (in dollars) at the Museum of Natural History? Give your answer as a number without any units, rounded to three decimal places.

**Answer:** 10.500

Recall, we can make predictions in standard units with the following formula

\text{predicted}\ y_{su} = r \cdot x_{su}

We’re given that the correlation coefficient, r, between visitors and admission cost is
0.25. Here, we’re using the number of visitors (x) to predict admission cost (y). Given that the number of visitors at the
Museum of Natural History is 6 standard deviations
**below** average,

\text{predicted}\ y_{su} = r \cdot x_{su} = 0.25 \cdot -6 = -1.5

We then compute y, which is the admission cost (in dollars) at the Museum of Natural History.

\begin{align*} y_{su} = \frac{y- \text{Mean of } y}{\text{SD of } y}\\ -1.5 &= \frac{y-15}{3} \\ -1.5 \cdot 3 &= y-15\\ -4.5 + 15 &= y\\ y &= \boxed{10.5} \end{align*}

So, the regression line predicts that the admission cost at the Museum of Natural History is $10.50.

The average score on this problem was 62%.

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of several species of penguins in a region of Antarctica.

One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

Consider the histogram of mass below.

Select the true statement below.

The median mass of penguins is larger than the average mass of penguins

The median mass of penguins is roughly equal to the average mass of penguins (within 50 grams)

The median mass of penguins is less than the average mass of penguins

It is impossible to determine the relationship between the median and average mass of penguins just by looking at the above histogram

**Answer:** The median mass of penguins is less than the
average mass of penguins

This is a distribution that is skewed to the right, so mean is greater than median.

The average score on this problem was 87%.

Which of the following is a valid conclusion that we can draw solely from the histogram above?

The number of penguins with a mass of exactly 3500 grams is greater than the number of penguins with a mass of exactly 5500 grams.

The number of penguins with a mass of at most 3500 grams is greater than the number of penguins with a mass of at least 5500 grams.

There is an odd number of penguins in the dataset.

The number penguins with a mass of exactly 4000 grams is greater than zero.

None of the above.

**Answer:** The number of penguins with a mass of at
most 3500 grams is greater than the number of penguins with a mass of at
least 5500 grams.

Recall, a histogram has intervals on the axis, so we cannot know the frequency of an exact value. Thus, we cannot conclude statements 1, 3, 4. Since the frequency of an exact value is unknown, for statement 3, it is possible that all numbers we have in this distribution are even. Although in the graph, we are only given frequency rather than number, we can justify statement 2 by comparing the area in the left side of 3500, and the area in the right side of 5500. You can either estimate by visually comparing the areas of both parts or compute the area sum of both sides by estimating the bars’ height and windth.

The average score on this problem was 89%.

For your convenience, we show the histogram of mass again below.

Recall, there are 330 penguins in our dataset. Their average mass is 4200 grams, and the standard deviation of mass is 840 grams.

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 3276 grams and 5124 grams? Input your answer as a percentage between 0 and 100, without the % symbol. Round to three decimal places.

**Answer:** 17.355

Recall, Chebyshev’s inequality states that No matter what the shape
of the distribution is, the proportion of values in the range “average ±
z SDs” is **at least** 1 -
\frac{1}{z^2}.

To approach the problem, we’ll start by converting 3276 grams and
5124 grams to standard units. Doing so yields \frac{3276 - 4200}{840} = -1.1, similarly,
\frac{5124 - 4200}{840} = 1.1. This
means that 3276 is 1.1 standard deviations **below** the
mean, and 5124 is 1.1 standard deviations **above** the
mean. Thus, we are calculating the proportion of values in the range
“average ± 1.1 SDs”.

When z = 1.1, we have 1 - \frac{1}{z^2} = 1 - \frac{1}{1.1^2} \approx 0.173553719, which as a percentage rounded to three decimal places is 17.355\%.

The average score on this problem was 76%.

Per Chebyshev’s inequality, at least what percentage of penguins have a mass between 1680 grams and 5880 grams?

50%

55.5%

65.25%

68%

75%

88.8%

95%

**Answer:** 75%

Recall: proportion with z SDs of the mean

Percent in Range | All Distributions (via Chebyshev’s Inequality) | Normal Distributions |
---|---|---|

\text{average} \pm 1 \ \text{SD} | \geq 0\% | \approx 68\% |

\text{average} \pm 2\text{SDs} | \geq 75\% | \approx 95\% |

\text{average} \pm 3\text{SDs} | \geq 88\% | \approx 99.73\% |

To approach the problem, we’ll start by converting 3276 grams and
5124 grams to standard units. Doing so yields \frac{1680 - 4200}{840} = -3, similarly,
\frac{5880 - 4200}{840} = 2. This means
that 1680 is 3 standard deviations **below** the mean, and
5880 is 2 standard deviations **above** the mean.

Proportion of values in [-3 SUs, 2 SUs] >= Proportion of values in
[-2 SUs, 2 SUs] >= 75% (Since we cannot assume that the distribution
is normal, we look at the **All Distributions (via Chebyshev’s
Inequality)** column for proportion).

Thus, **at least** 75% of the penguins have a mass
between 1680 grams and 5880 grams.

The average score on this problem was 72%.

The distribution of mass in grams is not roughly normal. Is the distribution of mass in standard units roughly normal?

Yes

No

Impossible to tell

**Answer:** No

The shape of the distribution does not change since we are scaling the x values for all data.

The average score on this problem was 60%.

Suppose all 330 penguin body masses (in grams) that the researchers
collected are stored in an array called `masses`

. We’d like
to estimate the probability that two different randomly selected
penguins from our dataset have body masses within 50 grams of one
another (including a difference of exactly 50 grams). Fill in the
missing pieces of the simulation below so that the function
`estimate_prob_within_50g`

returns an estimate for this
probability.

```
def estimate_prob_within_50g():
= 10000
num_reps = 0
within_50g_count for i in np.arange(num_reps):
= np.random.choice(__(a)__)
two_penguins if __(b)__:
= within_50g_count + 1
within_50g_count return within_50g_count / num_reps
```

What goes in blank (a)? What goes in blank (b)?

**Answer:** (a) `masses, 2, replace=False`

(b) `abs(two_penguins[0] - two_penguins[1])<=50`

- Recall,
`np.random.choice( )`

can have three parameters`array, n, replace=False`

, and returns n elements from the array at random, without replacement. We are randomly choosing**2 different**penguins from the`masses`

**array**, so we are using`np.random.choice( )`

without replacement. - We want to count the number of pairs of penguins that have body
masses difference within 50 grams, so we are using the index to access
the two penguins generated from
`two_penguins`

and calculating their absolute difference with`abs()`

. And in this`if`

condition, we only want to have penguins with absolute difference less than or equal to 50, so we write a`<=`

condition to justify whether the generated pairs of penguins fulfill this requirement.

The average score on this problem was 84%.

Recall, there are 330 penguins in our dataset. Their average mass is 4200 grams, and the standard deviation of mass is 840 grams. Assume that the 330 penguins in our dataset are a random sample from the population of all penguins in Antarctica. Our sample gives us one estimate of the population mean.

To better estimate the population mean, we bootstrapped our sample and plotted a histogram of the resample means, then took the middle 68 percent of those values to get a confidence interval. Which option below shows the histogram of the resample means and the confidence interval we found?

Option 1

Option 2

Option 3

Option 4

**Answer:** Option 2

Recall, according to the Central Limit Theorem (CLT), the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

Thus, our graph should have a normal distribution. We eliminate
**Option 4**.

Recall that the standard normal curve has inflection points at z = +-1, which is 68% proportion of a normal
distribution.(inflection point is where a curve goes from “opening down”
to “opening up”) Since we have a confidence intervel of 68% in this
question, by looking at the inflection point, we can eliminate
**Option 3**

To compute the SD of the sample mean’s distribution, when we don’t know the population’s SD, we can use the sample’s SD (840): \text{SD of Distribution of Possible Sample Means} \approx \frac{\text{Sample SD}}{\sqrt{\text{sample size}}} = \frac{840}{\sqrt{330}} \approx 46.24

Recall: proportion with z SDs of the mean

Percent in Range | All Distributions (via Chebyshev’s Inequality) | Normal Distributions |
---|---|---|

\text{average} \pm 1 \ \text{SD} | \geq 0\% | \approx 68\% |

\text{average} \pm 2\text{SDs} | \geq 75\% | \approx 95\% |

\text{average} \pm 3\text{SDs} | \geq 88\% | \approx 99.73\% |

In this question, we want 68% confidence interval, given that the
distribution of sample mean is roughly normal, our CI should have range
\text{sample mean} \pm 1 \ \text{SD}.
Thus, the interval is approximately [4200-46.24 = 4153.76, 4200+46.24=4246.24].
We compare the 68% CI in Option 1, 2 and we choose **Option
2** since it has a 68% CI with approximately the same
interval.

The average score on this problem was 66%.

Suppose `boot_means`

is an array of the resampled means.
Fill in the blanks below so that `[left, right]`

is a 68%
confidence interval for the true mean mass of penguins.

```
= np.percentile(boot_means, __(a)__)
left = np.percentile(boot_means, __(b)__)
right [left, right]
```

What goes in blank (a)? What goes in blank (b)?

**Answer:** (a) 16 (b) 84

Recall, `np.percentile(array, p)`

computes the
`p`

th percentile of the numbers in `array`

. To
compute the 68% CI, we need to know the percentile of left tail and
right tail.

left percentile = (1-0.68)/2 = (0.32)/2 = 0.16 so we have 16th percentile

right percentile = 1-((1-0.68)/2) = 1-((0.32)/2) = 1-0.16 = 0.84 so we have 84th percentile

The average score on this problem was 94%.

Which of the following is a correct interpretation of this confidence interval? Select all that apply.

There is an approximately 68% chance that mean weight of all penguins in Antarctica falls within the bounds of this confidence interval.

Approximately 68% of penguin weights in our sample fall within the bounds of this confidence interval.

Approximately 68% of penguin weights in the population fall within the bounds of this interval.

If we created many confidence intervals using the same method, approximately 68% of them would contain the mean weight of all penguins in Antarctica.

None of the above

**Answer:** Option 4 (If we created many confidence
intervals using the same method, approximately 68% of them would contain
the mean weight of all penguins in Antarctica.)

Recall, what a k% confidence level states is that approximately k% of the time, the intervals you create through this process will contain the true population parameter.

In this question, our population parameter is the mean weight of all penguins in Antarctica. So 86% of the time, the intervals you create through this process will contain the mean weight of all penguins in Antarctica. This is the same as Option 4. However, it will be false if we state it in the reverse order (Option 1) since our population parameter is already fixed.

The average score on this problem was 81%.

Now let’s study the relationship between a penguin’s bill length (in millimeters) and mass (in grams). Suppose we’re given that

- bill length and body mass have a correlation coefficient of 0.55
- the average bill length is 44 mm and the standard deviation of bill lengths is 6 mm
- as before, the average body mass is 4200 grams and the standard deviation of body mass is 840 grams

Which of the four scatter plots below describe the relationship between bill length and body mass, based on the information provided in the question?

Option 1

Option 2

Option 3

Option 4

**Answer** Option 3

Given the correlation coefficient is 0.55, bill length and body mass has a moderate positive correlation. We eliminate Option 1 (strong correlation) and Option 4 (weak correlation).

Given the average bill length is 44 mm, we expect our x-axis to have 44 at the middle, so we eliminate Option 2

The average score on this problem was 91%.

Suppose we want to find the regression line that uses bill length, x, to predict body mass, y. The line is of the form y = mx +\ b. What are m and b?

What is m? Give your answer as a number without any units, rounded to three decimal places.

What is b? Give your answer as a number without units, rounded to three decimal places.

**Answer:** m = 77,
b = 812

m = r \cdot \frac{\text{SD of }y }{\text{SD of }x} = 0.55 \cdot \frac{840}{6} = 77 b = \text{mean of }y - m \cdot \text{mean of }x = 4200-77 \cdot 44 = 812

The average score on this problem was 92%.

What is the predicted body mass (in grams) of a penguin whose bill length is 44 mm? Give your answer as a number without any units, rounded to three decimal places.

**Answer:** 4200

y = mx\ +\ b = 77 \cdot 44 + 812 = 3388 +812 = 4200

The average score on this problem was 95%.

A particular penguin had a predicted body mass of 6800 grams. What is that penguin’s bill length (in mm)? Give your answer as a number without any units, rounded to three decimal places.

**Answer:** 77.766

In this question, we want to compute x value given y value y = mx\ +\ b y - b = mx \frac{y - b}{m} = x\ \ \text{(m is nonzero)} x = \frac{y - b}{m} = \frac{6800 - 812}{77} = \frac{5988}{77} \approx 77.766

The average score on this problem was 88%.

Below is the residual plot for our regression line.

Which of the following is a valid conclusion that we can draw solely from the residual plot above?

For this dataset, there is another line with a lower root mean squared error

The root mean squared error of the regression line is 0

The accuracy of the regression line’s predictions depends on bill length

The relationship between bill length and body mass is likely non-linear

None of the above

**Answer:** The accuracy of the regression line’s
predictions depends on bill length

The vertical spread in this residual plot is uneven, which implies that the regression line’s predictions aren’t equally accurate for all inputs. This doesn’t necessarily mean that fitting a nonlinear curve would be better. It just impacts how we interpret the regression line’s predictions.

The average score on this problem was 40%.

Each individual penguin in our dataset is of a certain species (Adelie, Chinstrap, or Gentoo) and comes from a particular island in Antarctica (Biscoe, Dream, or Torgerson). There are 330 penguins in our dataset, grouped by species and island as shown below.

Suppose we pick one of these 330 penguins, uniformly at random, and name it Chester.

What is the probability that Chester comes from Dream island? Give your answer as a number between 0 and 1, rounded to three decimal places.

**Answer:** 0.373

P(Chester comes from Dream island) = # of penguins in dream island / # of all penguins in the data = \frac{55+68}{330} \approx 0.373

The average score on this problem was 94%.

If we know that Chester comes from Dream island, what is the probability that Chester is an Adelie penguin? Give your answer as a number between 0 and 1, rounded to three decimal places.

**Answer:** 0.447

P(Chester is an Adelie penguin given that Chester comes from Dream island) = # of Adelie penguins from Dream island / # of penguins from Dream island = \frac{55}{55+68} \approx 0.447

The average score on this problem was 91%.

If we know that Chester is not from Dream island, what is the probability that Chester is not an Adelie penguin? Give your answer as a number between 0 and 1, rounded to three decimal places.

**Answer:** 0.575

Method 1

P(Chester is not an Adelie penguin given that Chester is not from Dream island) = # of penguins that are not Adelie penguins from islands other than Dream island / # of penguins in island other than Dream island = \frac{119\ \text{(eliminate all penguins that are Adelie or from Dream island, only Gentoo penguins from Biscoe are left)}}{44+44+119} \approx 0.575

Method 2

P(Chester is not an Adelie penguin given that Chester is not from Dream island) = 1- (# of penguins that are Adelie penguins from islands other than Dream island / # of penguins in island other than Dream island) = 1-\frac{44+44}{44+44+119} \approx 0.575

The average score on this problem was 85%.

We’re now interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that `adelie_chinstrap`

is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – `'species'`

and `'mass'`

.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

Which of the following statements best describe the procedure above?

This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample

This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses

This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass

**Answer:** This is a permutation test, and our test
statistic is the difference in the mean Adelie mass and mean Chinstrap
mass (Option 4)

Recall, a permutation test helps us decide whether two random samples
come from the same distribution. This test matches our goal of testing
whether the masses of Adelie penguins and Chinstrap penguins are drawn
from the same population distribution. The code above are also doing
steps of a permutation test. In part (a), it shuffles
`'species'`

and stores the shuffled series to
`shuffled`

. In part (b), it assign the shuffled series of
values to `'species'`

column. Then, it uses
`grouped = with_shuffled.groupby('species').mean()`

to
calculate the mean of each species. In part (c), it computes the
difference between mean mass of the two species by first getting the
`'mass'`

column and then accessing mean mass of each group
(Adelie and Chinstrap) with positional index `0`

and
`1`

.

The average score on this problem was 98%.

Currently, line (c) (marked with a comment) uses .iloc. Which of the following options compute the exact same statistic as line (c) currently does?

Option 1:

`= grouped.get('mass').loc['Adelie'] - grouped.get('mass').loc['Chinstrap'] stat `

Option 2:

`= grouped.get('mass').loc['Chinstrap'] - grouped.get('mass').loc['Adelie'] stat `

Option 1 only

Option 2 only

Both options

Neither option

**Answer:** Option 1 only

We use `df.get(column_name).iloc[positional_index]`

to
access the value in a column with `positional_index`

.
Similarly, we use `df.get(column_name).loc[index]`

to access
value in a column with its `index`

. Remember
`grouped`

is a DataFrame that
`groupby('species')`

, so we have species name
`'Adelie'`

and `'Chinstrap'`

as index for
`grouped`

.

Option 2 is incorrect since it does subtraction in the reverse order
which results in a different `stat`

compared to
`line(c)`

. Its output will be -1
\cdot `stat`

. Recall, in
`grouped = with_shuffled.groupby('species').mean()`

, we use
`groupby()`

and since `'species'`

is a column with
string values, our index will be sorted in alphabetical order. So,
`.iloc[0]`

is `'Adelie'`

and `.iloc[1]`

is `'Chinstrap'`

.

The average score on this problem was 81%.

Is it possible to re-write `line (c)`

in a way that uses
`.iloc[0]`

twice, without any other uses of `.loc`

or `.iloc`

?

Yes, it’s possible

No, it’s not possible

**Answer:** Yes, it’s possible

There are multiple ways to achieve this. For instance
`stat = grouped.get('mass').iloc[0] - grouped.sort_index(ascending = False).get('mass').iloc[0]`

.

The average score on this problem was 64%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
as all values in the `stats`

array would be exactly the same
(Option 2)

Recall, `DataFrame.sample(n, replace = False)`

(or
`DataFrame.sample(n)`

since `replace = False`

is
by default) returns a DataFrame by randomly sampling `n`

rows
from the DataFrame, without replacement. Since our `n`

is
`adelie_chinstrap.shape[0]`

, and we are sampling without
replacement, we will get the exactly same Dataframe (though the order of
rows may be different but the `stats`

array would be exactly
the same).

The average score on this problem was 87%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we removed `line (a)`

, and replaced
`line (b)`

with

`= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled `

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would not run a valid hypothesis test,
even though there would be several different values in the
`stats`

array (Option 3)

Recall, `DataFrame.sample(n, replace = True)`

returns a
new DataFrame by randomly sampling `n`

rows from the
DataFrame, with replacement. Since we are sampling with replacement, we
will have a DataFrame which produces a `stats`

array with
some different values. However, recall, the key idea behind a
permutation test is to shuffle the group labels. So, the above code does
not meet this key requirement since we only want to shuffle the
`"species"`

column without changing the size of the two
species. However, the code may change the size of the two species.

The average score on this problem was 66%.

For your convenience, we copy the code for the hypothesis test below.

```
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
```

What would happen if we replaced `line (a)`

with

```
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
```

and replaced line (b) with

```
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
```

Select the best answer.

This would still run a valid hypothesis test

This would not run a valid hypothesis test, as all values in the

`stats`

array would be exactly the sameThis would not run a valid hypothesis test, even though there would be several different values in the

`stats`

arrayThis would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins

**Answer:** This would still run a valid hypothesis test
(Option 1)

Our goal for the permutation test is to randomly assign birth weights
to groups, without changing group sizes. The above code shuffles
`'species'`

and `'mass'`

columns and assigns them
back to the DataFrame. This fulfills our goal.

The average score on this problem was 81%.

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?

0

\frac{1}{4}

\frac{1}{3}

\frac{2}{3}

\frac{3}{4}

1

**Answer:** \frac{1}{3}

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. Thus, we compute the proportion of the test statistic that is equal or less than the observed statistic. (It is less than because less than corresponds to the alternative hypothesis “Chinstrap penguins weigh more on average than Adelie penguins”. Recall, when computing the statistic, we use Adelie’s mean mass minus Chinstrap’s mean mass. If Chinstrap’s mean mass is larger, the statistic will be negative, the direction of less than the observed statistic).

Thus, we look at the proportion of area less than or on the red line (which represents observed statistic), it is around \frac{1}{3}.

The average score on this problem was 80%.

At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.

`= np.array([550, 1550, 400]) admissions_data `

Complete the code below so that it creates an array
`admissions_proportions`

with the proportions of tickets sold
to each group (in the order child, adult, senior).

```
def as_proportion(data):
return __(a)__
= as_proportion(admissions_data) admissions_proportions
```

What goes in blank (a)?

**Answer:** `data/data.sum()`

To calculate proportion for each group, we divide each value in the array (tickets sold to each group) by the sum of all values (total tickets sold). Remember values in an array can be processed as a whole.

The average score on this problem was 95%.

The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.

`= np.array([0.25, 0.6, 0.15]) model `

We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?

Child, adult, and senior tickets might plausibly be purchased in proportions 0.25, 0.6, and 0.15.

Child, adult, and senior tickets are purchased in proportions 0.25, 0.6, and 0.15.

Child, adult, and senior tickets might plausibly be purchased in proportions other than 0.25, 0.6, and 0.15.

Child, adult, and senior tickets, are purchased in proportions other than 0.25, 0.6, and 0.15.

**Answer:** Child, adult, and senior tickets are
purchased in proportions 0.25, 0.6, and 0.15. (Option 2)

Recall, null hypothesis is the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. So, we assume the distribution is the same as the model.

The average score on this problem was 88%.

Which of the following test statistics could we use to test our hypotheses? Select all that could work.

sum of differences in proportions

sum of squared differences in proportions

mean of differences in proportions

mean of squared differences in proportions

none of the above

**Answer:** sum of squared differences in proportions,
mean of squared differences in proportions (Option 2, 4)

We need to use squared difference to avoid the case that large positive and negative difference cancel out in the process of calculating sum or mean, resulting in small sum of difference or mean of difference that does not reflect the actual deviation. So, we eliminate Option 1 and 3.

The average score on this problem was 77%.

Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.

Recall that the ticket counts we observed for children, adults, and
seniors are stored in the array
`admissions_data = np.array([550, 1550, 400])`

, and that our
model is `model = np.array([0.25, 0.6, 0.15])`

.

For our hypothesis test to determine whether the admissions data is
consistent with our model, what is the observed value of the test
statistic? Input your answer as a decimal between 0 and 1. Round to
three decimal places. (Suppose that the value you calculated is assigned
to the variable `observed_stat`

, which you will use in later
questions.)

**Answer:** 0.02

We first calculate the proportion for each value in
`admissions_data`

\frac{550}{550+1550+400} = 0.22 \frac{1550}{550+1550+400} = 0.62 \frac{400}{550+1550+400} = 0.16 So, we have
the distribution of the `admissions_data`

Then, we calculate the observed value of the test statistic (the mean of the absolute differences in proportions) \frac{|0.22-0.25|+|0.62-0.6|+|0.16-0.15|}{number\ of\ goups} =\frac{0.03+0.02+0.01}{3} = 0.02

The average score on this problem was 82%.

Now, we want to simulate the test statistic 10,000 times under the
assumptions of the null hypothesis. Fill in the blanks below to complete
this simulation and calculate the p-value for our hypothesis test.
Assume that the variables `admissions_data`

,
`admissions_proportions`

, `model`

, and
`observed_stat`

are already defined as specified earlier in
the question.

```
= np.array([])
simulated_stats for i in np.arange(10000):
= as_proportions(np.random.multinomial(__(a)__, __(b)__))
simulated_proportions = __(c)__
simulated_stat = np.append(simulated_stats, simulated_stat)
simulated_stats
= __(d)__ p_value
```

What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?

**Answer:** (a) `admissions_data.sum()`

(b)
`model`

(c)
`np.abs(simulated_proportions - model).mean()`

(d)
`np.count_nonzero(simulated_stats >= observed_stat) / 10000`

Recall, in `np.random.multinomial(n, [p_1, ..., p_k])`

,
`n`

is the number of experiments, and
`[p_1, ..., p_k]`

is a sequence of probability. The method
returns an array of length k in which each element contains the number
of occurrences of an event, where the probability of the ith event is
`p_i`

.

We want our `simulated_proportion`

to have the same data
size as `admissions_data`

, so we use
`admissions_data.sum()`

in (a).

Since our null hypothesis is based on `model`

, we simulate
based on distribution in `model`

, so we have
`model`

in (b).

In (c), we compute the mean of the absolute differences in
proportions. `np.abs(simulated_proportions - model)`

gives us
a series of absolute differences, and `.mean()`

computes the
mean of the absolute differences.

In (d), we calculate the `p_value`

. Recall, the
`p_value`

is the chance, under the null hypothesis, that the
test statistic is equal to the value that was observed in the data or is
even further in the direction of the alternative.
`np.count_nonzero(simulated_stats >= observed_stat)`

gives
us the number of `simulated_stats`

greater than or equal to
the `observed_stat`

in the 10000 times simulations, so we
need to divide it by `10000`

to compute the proportion of
`simulated_stats`

greater than or equal to the
`observed_stat`

, and this gives us the
`p_value`

.

The average score on this problem was 79%.

True or False: the p-value represents the probability that the null hypothesis is true.

True

False

**Answer:** False

Recall, the p-value is the chance, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. It only gives us the strength of evidence in favor of the null hypothesis, which is different from “the probability that the null hypothesis is true”.

The average score on this problem was 64%.

The new statistic that we used for this hypothesis test, the mean of
the absolute differences in proportions, is in fact closely related to
the total variation distance. Given two arrays of length three,
`array_1`

and `array_2`

, suppose we compute the
mean of the absolute differences in proportions between
`array_1`

and `array_2`

and store the result as
`madp`

. What value would we have to multiply
`madp`

by to obtain the total variation distance
`array_1`

and `array_2`

? Input your answer below,
rounding to three decimal places.

**Answer:** 1.5

Recall, the total variation distance (TVD) is the sum of the absolute differences in proportions, divided by 2. When we compute the mean of the absolute differences in proportions, we are computing the sum of the absolute differences in proportions, divided by the number of groups (which is 3). Thus, to get TVD, we first multiply our current statistics (the mean of the absolute differences in proportions) by 3, we get the sum of the absolute differences in proportions. Then according to the definition of TVD, we divide this value by 2. Thus, we have \text{current statistics}\cdot 3 / 2 = \text{current statistics}\cdot 1.5.

The average score on this problem was 65%.