← return to practice.dsc10.com

Welcome! The problems shown below should be worked on **on
paper**, since the quizzes and exams you take in this course will
also be on paper. You do not need to submit your solutions anywhere.

We encourage you to complete this worksheet in groups during an
extra practice session on Friday, February 9th. Solutions will be posted
after all sessions have finished. This problem set is not designed to
take any particular amount of time - focus on understanding concepts,
not on getting through all the questions.

The questions below
focus on simulation and sampling. To review for the midterm more
generally, we recommend working through past midterm exams, especially
Winter
2023 and
Fall
2023.

King Triton has boarded a Southwest flight. For in-flight refreshments, Southwest serves four types of cookies – chocolate chip, gingerbread, oatmeal, and peanut butter.

The flight attendant comes to King Triton with a box containing 10 cookies:

- 4 chocolate chip
- 3 gingerbread
- 2 oatmeal, and
- 1 peanut butter

The flight attendant tells King Triton to grab 2 cookies out of the box without looking.

Fill in the blanks below to implement a simulation that estimates the probability that both of King Triton’s selected cookies are the same.

```
# 'cho' stands for chocolate chip, 'gin' stands for gingerbread,
# 'oat' stands for oatmeal, and 'pea' stands for peanut butter.
= np.array(['cho', 'cho', 'cho', 'cho', 'gin',
cookie_box 'gin', 'gin', 'oat', 'oat', 'pea'])
= 10000
repetitions = 0
prob_both_same for i in np.arange(repetitions):
= np.random.choice(__(a)__)
grab if __(b)__:
= prob_both_same + 1
prob_both_same = __(c)__ prob_both_same
```

What goes in blank (a)?

`cookie_box, repetitions, replace=False`

`cookie_box, 2, replace=True`

`cookie_box, 2, replace=False`

`cookie_box, 2`

**Answer:**
`cookie_box, 2, replace=False`

We are told that King Triton grabs two cookies out of the box without
looking. Since this is a random choice, we use the function
`np.random.choice`

to simulate this. The first input to this
function is a sequence of values to choose from. We already have an
array of values to choose from in the variable `cookie_box`

.
Calling `np.random.choice(cookie_box)`

would select one
cookie from the cookie box, but we want to select two, so we use an
optional second parameter to specify the number of items to randomly
select. Finally, we should consider whether we want to select with or
without replacement. Since `cookie_box`

contains individual
cookies and King Triton is selecting two of them, he cannot choose the
same exact cookie twice. This means we should sample without
replacement, by specifying `replace=False`

. Note that
omitting the `replace`

parameter would use the default option
of sampling with replacement.

The average score on this problem was 92%.

What goes in blank (b)?

**Answer:** `grab[0] == grab[1]`

The idea of a simulation is to do some random process many times. We
can use the results to approximate a probability by counting up the
number of times some event occurred, and dividing that by the number of
times we did the random process. Here, the random process is selecting
two cookies from the cookie box, and we are doing this 10,000 times. The
approximate probability will be the number of times in which both
cookies are the same divided by 10,000. So we need to count up the
number of times that both randomly selected cookies are the same. We do
this by having an accumulator variable that starts out at 0 and gets
incremented, or increased by 1, every time both cookies are the same.
The code has such a variable, called `prob_both_same`

, that
is initialized to 0 and gets incremented when some condition is met.

We need to fill in the condition, which is that both randomly
selected cookies are the same. We’ve already randomly selected the
cookies and stored the results in `grab`

, which is an array
of length 2 that comes from the output of a call to
`np.random.choice`

. To check if both elements of the
`grab`

array are the same, we access the individual elements
using brackets with the position number, and compare using the
`==`

symbol to check equality. Note that at the end of the
`for`

loop, the variable `prob_both_same`

will
contain a count of the number of trials out of 10,000 in which both of
King Triton’s cookies were the same flavor.

The average score on this problem was 79%.

What goes in blank (c)?

`prob_both_same / repetitions`

`prob_both_same / 2`

`np.mean(prob_both_same)`

`prob_both_same.mean()`

**Answer:**
`prob_both_same / repetitions`

After the `for`

loop, `prob_both_same`

contains
the number of trials out of 10,000 in which both of King Triton’s
cookies were the same flavor. We’d like it to represent the approximate
probability of both cookies being the same flavor, so we need to divide
the current value by the total number of trials, 10,000. Since this
value is stored in the variable `repetitions`

, we can divide
`prob_both_same`

by `repetitions`

.

The average score on this problem was 93%.

The fine print of the Sun God festival website says “Ticket does not
guarantee entry. Venue subject to capacity restrictions.” RIMAC field,
where the 2022 festival will be held, has a capacity of 20,000 people.
Let’s say that UCSD distributes 21,000 tickets to Sun God 2022 because
prior data shows that 5% of tickets distributed are never actually
redeemed. Let’s suppose that each person with a ticket this year has a
5% chance of not attending (independently of all others). What is the
probability that at least one student who has a ticket cannot get in due
to the capacity restriction? Fill in the blanks in the code below so
that `prob_angry_student`

evaluates to an approximation of
this probability.

```
= 0
num_angry
for rep in np.arange(10000):
# randomly choose 21000 elements from [True, False] such that
# True has probability 0.95, False has probability 0.05
= np.random.choice([True, False], 21000, p=[0.95, 0.05])
attending if __(a)__:
__(b)__
= __(c)__ prob_angry_student
```

What goes in the **first** blank?

`np.count_nonzero(attending) == 20001`

`attending[20000] == False`

`attending.sum() > 20000`

`np.count_nonzero(attending) > num_angry`

**Answer: ** `attending.sum() > 20000`

Let’s look at the variable `attending`

. Since we’re
choosing 21,000 elements from the list `[True, False]`

and
there are 21,000 tickets distributed, this code is randomly determining
whether each ticket holder will actually attend the festival. There’s a
95% chance of each ticket holder attending, which is reflected in the
`p=[0.95, 0.05]`

argument. Remember that
`np.random.choice`

returns an array of random choices, which
in this case means it will contain 21,000 elements, each of which is
`True`

or `False`

.

We want to figure out the probability of at least one ticket holder
showing up and not being admitted. Another way to say this is we want to
find the probability that more than 20,000 ticket holders show up to
attend the festival. The way we approximate a probability through
simulation is we repeat a process many times and see how often some
event occured. The event we’re interested in this case is that more than
20,000 ticket holders came to Sun God. Since we have an array of
`True`

and `False`

values corresponding to whether
each ticket holder actually came, we just need to determine if there are
more than 20,000 `True`

values in the `attending`

array.

There are several ways to count the number of `True`

values in a Boolean array. One way is to sum the array since in Python
`True`

counts as 1 and `False`

counts as 0.
Therefore, `attending.sum() > 20000`

is the condition we
need to check here.

The average score on this problem was 67%.

What goes in the **second** blank?

**Answer: ** `num_angry = num_angry + 1`

Remember our goal in simulation is to repeat a process many times to
see how often some event occurs. The repetition comes from the
`for`

loop which runs 10,000 times. Each time, we are
simulating the process of 21,000 students each randomly deciding whether
to show up to Sun God or not. We want to know, out of these 10,000
trials, how frequently more than 20,000 of the students will show up. So
when this happens, we want to record that it happened. The standard way
to do that is to keep a counter variable that starts at 0 and gets
incremented, or increased by one, each time we had more than 20,000
attendees in our simulation.

The framework to do this is already set up because a variable called
`num_angry`

is initialized to 0 before the `for`

loop. This variable is our counter variable, meant to count the number
of trials, out of 10,000, that resulted in at least one student being
angry because they showed up to Sun God with a ticket and were denied
entrance. So all we need to do when there are more than 20,000
`True`

values in the `attending`

array is
increment this counter by one via the code
`num_angry = num_angry + 1`

, sometimes abbreviated as
`num_angry += 1`

.

The average score on this problem was 59%.

What goes in the **third** blank?

**Answer: ** `num_angry/10000`

To calculate the approximate probability, all we need to do is divide the number of trials in which a student was angry by the total number of trials, which is 10,000.

The average score on this problem was 68%.

Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.

Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.

```
= 0
total = 10000
repetitions for i in np.arange(repetitions):
= np.random.choice(__(a)__, 10, __(b)__)
choices if __(c)__:
= total + 1
total = total / repetitions prob_renovate
```

What goes in blank (a)?

`np.arange(1, 75)`

`np.arange(10, 75)`

`np.arange(0, 76)`

`np.arange(1, 76)`

What goes in blank (b)?

`replace=True`

`replace=False`

What goes in blank (c)?

`choices == 23`

`choices is 23`

`np.count_nonzero(choices == 23) > 0`

`np.count_nonzero(choices) == 23`

`choices.str.contains(23)`

**Answer:** `np.arange(1, 76)`

,
`replace=False`

,
`np.count_nonzero(choices == 23) > 0`

Here, the idea is to randomly choose 10 **different**
floors repeatedly, and each time, check if floor 23 was selected.

Blank (a): The first argument to `np.random.choice`

needs
to be an array/list containing the options we want to choose from,
i.e. an array/list containing the values 1, 2, 3, 4, …, 75, since those
are the numbers of the floors. `np.arange(a, b)`

returns an
array of integers spaced out by 1 starting from `a`

and
ending at `b-1`

. As such, the correct call to
`np.arange`

is `np.arange(1, 76)`

.

Blank (b): Since we want to select 10 different floors, we need to
specify `replace=False`

(the default behavior is
`replace=True`

).

Blank (c): The `if`

condition needs to check if 23 was one
of the 10 numbers that were selected, i.e. if 23 is in
`choices`

. It needs to evaluate to a single Boolean value,
i.e. `True`

(if 23 was selected) or `False`

(if 23
was not selected). Let’s go through each incorrect option to see why
it’s wrong:

- Option 1,
`choices == 23`

, does not evaluate to a single Boolean value; rather, it evaluates to an array of length 10, containing multiple`True`

s and`False`

s. - Option 2,
`choices is 23`

, does not evaluate to what we want – it checks to see if the array`choices`

is the same Python object as the number 23, which it is not (and will never be, since an array cannot be a single number). - Option 4,
`np.count_nonzero(choices) == 23`

, does evaluate to a single Boolean, however it is not quite correct.`np.count_nonzero(choices)`

will always evaluate to 10, since`choices`

is made up of 10 integers randomly selected from 1, 2, 3, 4, …, 75, none of which are 0. As such,`np.count_nonzero(choices) == 23`

is the same as`10 == 23`

, which is always False, regardless of whether or not 23 is in`choices`

. - Option 5,
`choices.str.contains(23)`

, errors, since`choices`

is not a Series (and`.str`

can only follow a Series). If`choices`

were a Series, this would still error, since the argument to`.str.contains`

must be a string, not an int.

By process of elimination, Option 3,
`np.count_nonzero(choices == 23) > 0`

, must be the correct
answer. Let’s look at it piece-by-piece:

- As we saw in Option 1,
`choices == 23`

is a Boolean array that contains`True`

each time the selected floor was floor 23 and`False`

otherwise. (Since we’re sampling without replacement, floor 23 can only be selected at most once, and so`choices == 23`

can only contain the value`True`

at most once.) `np.count_nonzero(choices == 23)`

evaluates to the number of`True`

s in`choices == 23`

. If it is positive (i.e. 1), it means that floor 23 was selected. If it is 0, it means floor 23 was not selected.- Thus,
`np.count_nonzero(choices == 23) > 0`

evaluates to`True`

if (and only if) floor 23 was selected.

The average score on this problem was 75%.

In the previous subpart of this question, your answer to blank (c)
contained the number 23, and the simulated probability was stored in the
variable `prob_renovate`

.

Suppose, in blank (c), we change the number 23 to the number 46, and
we store the new simulated probability in the variable name
`other_prob`

. (`prob_renovate`

is unchanged from
the previous part.)

With these changes, which of the following is the most accurate
representation of the relationship between `other_prob`

and
`prob_renovate`

?

`other_prob`

will be roughly half of`prob_renovate`

`other_prob`

will be roughly equal to`prob_renovate`

`other_prob`

will be roughly double`prob_renovate`

**Answer:** `other_prob`

will be roughly
equal to `prob_renovate`

The calculation we did in the previous subpart was not specific to
the number 23. That is, we could have replaced 23 with any integer
between 1 and 75 inclusive and the simulation would have been just as
valid. The probability we estimated is the probability that **any
one floor** was randomly selected; there is nothing special about
23.

(We say “roughly equal” because the result may turn out slightly different due to randomness.)

The average score on this problem was 89%.

```
= np.array([])
results for i in np.arange(10):
= np.random.choice(np.arange(1000), replace=False)
result = np.append(results, result) results
```

After this code executes, `results`

contains:

a simple random sample of size 9, chosen from a set of size 999 with replacement

a simple random sample of size 9, chosen from a set of size 999 without replacement

a simple random sample of size 10, chosen from a set of size 1000 with replacement

a simple random sample of size 10, chosen from a set of size 1000 without replacement

**Answer: ** a simple random sample of size 10, chosen
from a set of size 1000 with replacement

Let’s see what the code is doing. The first line initializes an empty
array called `results`

. The for loop runs 10 times. Each
time, it creates a value called `result`

by some process
we’ll inspect shortly and appends this value to the end of the
`results`

array. At the end of the code snippet,
`results`

will be an array containing 10 elements.

Now, let’s look at the process by which each element
`result`

is generated. Each `result`

is a random
element chosen from `np.arange(1000)`

which is the numbers
from 0 to 999, inclusive. That’s 1000 possible numbers. Each time
`np.random.choice`

is called, just one value is chosen from
this set of 1000 possible numbers.

When we sample just one element from a set of values, sampling with replacement is the same as sampling without replacement, because sampling with or without replacement concerns whether subsequent draws can be the same as previous ones. When we’re just sampling one element, it really doesn’t matter whether our process involves putting that element back, as we’re not going to draw again!

Therefore, `result`

is just one random number chosen from
the 1000 possible numbers. Each time the `for`

loop executes,
`result`

gets set to a random number chosen from the 1000
possible numbers. It is possible (though unlikely) that the random
`result`

of the first execution of the loop matches the
`result`

of the second execution of the loop. More generally,
there can be repeated values in the `results`

array since
each entry of this array is independently drawn from the same set of
possibilities. Since repetitions are possible, this means the sample is
drawn with replacement.

Therefore, the `results`

array contains a sample of size
10 chosen from a set of size 1000 with replacement. This is called a
“simple random sample” because each possible sample of 10 values is
equally likely, which comes from the fact that
`np.random.choice`

chooses each possible value with equal
probability by default.

The average score on this problem was 11%.

Suppose we take a uniform random sample with replacement from a population, and use the sample mean as an estimate for the population mean. Which of the following is correct?

If we take a larger sample, our sample mean will be closer to the population mean.

If we take a smaller sample, our sample mean will be closer to the population mean.

If we take a larger sample, our sample mean is more likely to be close to the population mean than if we take a smaller sample.

If we take a smaller sample, our sample mean is more likely to be close to the population mean than if we take a larger sample.

**Answer: ** If we take a larger sample, our sample mean
is more likely to be close to the population mean than if we take a
smaller sample.

Larger samples tend to give better estimates of the population mean than smaller samples. That’s because large samples are more like the population than small samples. We can see this in the extreme. Imagine a sample of 1 element from a population. The sample might vary a lot, depending on the distribution of the population. On the other extreme, if we sample the whole population, our sample mean will be exactly the same as the population mean.

Notice that the correct answer choice uses the words “is more likely
to be close to” as opposed to “will be closer to.” We’re talking about a
general phenomenon here: larger samples tend to give better estimates of
the population mean than smaller samples. We cannot say that if we take
a larger sample our sample mean “will be closer to” the population mean,
since it’s always possible to get lucky with a small sample and unlucky
with a large sample. That is, one particular small sample may happen to
have a mean very close to the population mean, and one particular large
sample may happen to have a mean that’s not so close to the population
mean. This *can* happen, it’s just not likely to.

The average score on this problem was 100%.