← return to practice.dsc10.com

**Instructor(s):** Suraj Rampure, Puoya Tabaghi, Janine
Tiefenbruck

This exam was administered in-person. The exam was closed-notes,
except students were provided a copy of the DSC
10 Reference Sheet. No calculators were allowed. Students had
**50 minutes** to take this exam.

**In Lecture 14, we covered some of the problems from this exam. See
the slides
here
and the recording
here.**

In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).

The DataFrame `evs`

consists of **32** rows,
each of which contains information about a different EV model.

`"Brand"`

(str): The vehicle’s manufacturer.`"Model"`

(str): The vehicle’s model name.`"BodyStyle"`

(str): The vehicle’s body style.`"Seats"`

(int): The vehicle’s number of seats.`"TopSpeed"`

(int): The vehicle’s top speed, in kilometers per hour.`"Range"`

(int): The vehicle’s range, or distance it can travel on a single charge, in kilometers.

The first few rows of `evs`

are shown below (though
remember, `evs`

has 32 rows total).

**Throughout this exam, we will refer to evs
repeatedly.**

Assume that:

- The only four values in the
`"Brand"`

column are`"Tesla"`

,`"BMW"`

,`"Audi"`

, and`"Nissan"`

. - We have already run
`import babypandas as bpd`

and`import numpy as np`

.

**Tip:** Open this page in another tab, so that it is
easy to refer to this data description as you work through the exam.

**Note: This problem is out of scope; it
covers material no longer included in the course.**

TritonTire, a UCSD alumni-run car repair shop, has an electronic database of all services it has ever provided, dating back to 1987. The owner, Dasha, says:

Our collective records from all 50 states show that, over the years, our repair shops have performed more and more repairs on EVs. After discussing this with our employees, we believe this is because newer EVs are not made to the same standards as the gasoline cars before them.

In **at most two sentences**, identify the largest
confounding factor that Dasha has not considered in her argument. Your
answer must fit in the box below.

**Answer:** The confounding factor that Dasha hasn’t
considered is the prevalence of EVs. There are a lot more EVs now than
ever before, so even if EVs are made to the same standards as gasoline
cars or better, we’d expect to see more repairs of EVs now.

The average score on this problem was 68%.

Which type of visualization should we use to visualize the
distribution of `"Range"`

?

Bar chart

Histogram

Scatter plot

Line plot

**Answer:** Histogram

`"Range"`

is a numerical (i.e. quantitative) variable, and
we use histograms to visualize the distribution of numerical
variables.

- A bar chart couldn’t work here. Bar charts can show the distribution
of a categorical variable, but
`"Range"`

is not categorical. - A scatter plot visualizes the relationship between two numerical
variables, but we are only dealing with one numerical variable here
(
`"Range"`

). - Similarly, a line plot visualizes the relationship between two numerical variables, but we only have one here.

The average score on this problem was 63%.

Teslas, on average, tend to have higher `"Range"`

s than
BMWs. In which of the following visualizations would we be able to see
this pattern? Select all that apply.

A bar chart that shows the distribution of

`"Brand"`

A bar chart that shows the average

`"Range"`

for each`"Brand"`

An overlaid histogram showing the distribution of

`"Range"`

for each`"Brand"`

A scatter plot with

`"TopSpeed"`

on the x-axis and`"Range"`

on the y-axis

**Answer:**

- A bar chart that shows the average
`"Range"`

for each`"Brand"`

- An overlaid histogram showing the distribution of
`"Range"`

for each`"Brand"`

Let’s look at each option more closely.

**Option 1:**A bar chart showing the distribution of`"Brand"`

would only show us how many cars of each`"Brand"`

there are. It would not tell us anything about the average`"Range"`

of each`"Brand"`

.**Option 2:**A bar chart showing the average range for each`"Brand"`

would help us directly visualize how the average range of each`"Brand"`

compares to one another.**Option 3:**An overlaid histogram, although perhaps a bit messy, would also give us a general idea of the average range of each`"Brand"`

by giving us the distribution of the`"Range"`

of each brand. In the scenario mentioned in the question, we’d expect to see that the Tesla distribution is further right than the BMW distribution.**Option 4:**A scatter plot of`"TopSpeed"`

against`"Range"`

would only illustrate the relationship between`"TopSpeed"`

and`"Range"`

, but would contain no information about the`"Brand"`

of each EV.

The average score on this problem was 91%.

Gabriel thinks `"Seats"`

is a categorical variable because
it can be used to categorize EVs by size. For instance, EVs with 4 seats
are small, EVs with 5 seats are medium, and EVs with 6 or more seats are
large.

Is Gabriel correct?

Yes

No

Justify your answer in **one sentence**. Your answer
must fit in the box below.

**Answer:** No

`"Seats"`

is a numerical variable, since it makes sense to
do arithmetic with the values. For instance, we can find the average
number of `"Seats"`

that a group of cars has. Gabriel’s
argument could apply to any numerical variable; just because we can
place numerical variables into “bins” doesn’t make them categorical.

The average score on this problem was 51%.

Suppose we’ve run the following two lines of code.

```
= evs.get("Brand").apply(max)
first = evs.get("Brand").max() second
```

Note:

- The length of a value
`v`

is defined as`len(v)`

, unless`v`

is a DataFrame, in which case its length is`v.shape[0]`

. - If
`s`

is a string, then`max(s)`

also evaluates to a string.

Fill in the blanks: `first`

is a __(i)__ of length
__(ii)__.

(i):

list

array

string

DataFrame

Series

(ii): _____

**Answer:**

- (i): Series
- (ii): 32

The `.apply`

method applies a function on every element of
a Series. Here, `evs.get("Brand").apply(max)`

applies the
`max`

function on every element of the `"Brand"`

column of `evs`

, producing a new Series with the same length
as `evs`

.

While not necessary to answer the question, if `s`

is a
string, then `max(s)`

evaluates to the single character in
`s`

that is last in the alphabet. For instance,
`max("zebra")`

evaluates to `"z"`

. As such,
`evs.get("Brand").apply(max)`

is a Series of 32 elements,
each of which is a single character, corresponding to the latest
character in the alphabet for each entry in
`evs.get("Brand")`

.

The average score on this problem was 65%.

Fill in the blanks: `second`

is a __(i)__ of length
__(ii)__.

(i):

list

array

string

DataFrame

Series

(ii): _____

**Answer:**

- (i): string
- (ii): 5

The `.max()`

method will find the “largest” element in the
Series it is called in, which in this case is
`evs.get("Brand")`

. The way that strings are ordered is
alphabetically, so `evs.get("Brand").max()`

will be the last
value of `"Brand"`

alphabetically. Since we were told that
the only values in the `"Brand"`

column are
`"Tesla"`

, `"BMW"`

, `"Audi"`

, and
`"Nissan"`

, the “maximum” is `"Tesla"`

, which has
a length of 5.

The average score on this problem was 54%.

Suppose we’ve run the following line of code.

`= evs.groupby("Brand").count() counts `

What value does `counts.get("Range").sum()`

evaluate
to?

**Answer:** 32

`counts`

is a DataFrame with one row per
`"Brand"`

, since we grouped by `"Brand"`

. Since we
used the `.count()`

aggregation method, the columns in
`counts`

will all be the same – they will all contain the
number of rows in `evs`

for each `"Brand"`

(i.e. they will all contain the distribution of `"Brand"`

).
If we sum up the values in any one of the columns in
`counts`

, then, the result will be the total number of rows
in `evs`

, which we know to be 32. Thus,
`counts.get("Range").sum()`

is 32.

The average score on this problem was 56%.

What value does `counts.index[3]`

evaluate to?

**Answer:** `"Tesla"`

Since we grouped by `"Brand"`

to create
`counts`

, the index of `counts`

will be
`"Brand"`

, sorted alphabetically (this sorting happens
automatically when grouping). This means that `counts.index`

will be the array-like sequence
`["Audi", "BMW", "Nissan", "Tesla"]`

, and
`counts.index[3]`

is `"Tesla"`

.

The average score on this problem was 33%.

Consider the following incomplete assignment statement.

`= evs______.mean() result `

In each part, fill in the blank above so that result evaluates to the specified quantity.

A DataFrame, indexed by `"Brand"`

, whose
`"Seats"`

column contains the average number of
`"Seats"`

per `"Brand"`

. (The DataFrame may have
other columns in it as well.)

**Answer:** `.groupby("Brand")`

When we group by a column, the resulting DataFrame contains one row
for every unique value in that column. The question specified that we
wanted some information *per* `"Brand"`

, which implies
that grouping by `"Brand"`

is necessary.

After grouping, we need to use an aggregation method. Here, we wanted
the resulting DataFrame to be such that the `"Seats"`

column
contained the average number of `"Seats"`

per
`"Brand"`

; this is accomplished by using
`.mean()`

, which is already done for us.

Note: With the provided solution, the resulting DataFrame also has
other columns. For instance, it has a `"Range"`

column that
contains the average `"Range"`

for each `"Brand"`

.
That’s fine, since we were told that the resulting DataFrame may have
other columns in it as well. If we wanted to ensure that the only column
in the resulting DataFrame was `"Seats"`

, we could have used
`.get(["Brand", "Seats"])`

before grouping, though this was
not necessary.

The average score on this problem was 76%.

A number, corresponding to the average `"TopSpeed"`

of all
EVs manufactured by Audi in evs

**Answer:**
`[evs.get("Brand") == "Audi"].get("TopSpeed")`

There are two parts to this problem:

Querying, to make sure that we only keep the rows corresponding to Audis. This is accomplished by:

- Using
`evs.get("Brand") == "Audi"`

to create a Boolean Series, with`True`

s for the rows we want to keep and`False`

s for the other rows. - Using Boolean indexing to keep only the rows in which the
aforementioned Series is
`True`

. This is accomplished by`evs[evs.get("Brand") == "Audi"]`

(though the`evs`

part at the front was already provided).

- Using
Accessing the

`"TopSpeed"`

column. This is accomplished by using`.get("TopSpeed")`

.

Then, `evs[evs.get("Brand") == "Audi"].get("TopSpeed")`

is
a Series contaning the `"TopSpeed"`

s of all Audis, and mean
of this Series is the result we’re looking for. The call to
`.mean()`

was already provided for us.

The average score on this problem was 77%.

A number, corresponding to the average of the natural logarithm of
the `"TopSpeed"`

of all EVs in evs. (Hint: The function
`np.log`

computes the natural logarithm of a single
number.)

**Answer:**
`.get("TopSpeed").apply(np.log)`

The `.apply`

method is used to apply a function on every
element of a Series. The relevant Series here is the column containing
the `"TopSpeed"`

of each EV,
i.e. `evs.get("TopSpeed")`

(the `evs`

part was
already provided to us).

After we get that Series, we need to use the function
`np.log`

on every element of it. This is accomplished by
using `.apply(np.log)`

. Putting our steps so far together, we
have `evs.get("TopSpeed").apply(np.log)`

, which is a Series
containing the natural logarithm of the `"TopSpeed"`

of all
EVs in `evs`

.

The number we were asked for was the average of the natural logarithm
of the `"TopSpeed"`

of all EVs in `evs`

; all we
need to do now is use the `.mean()`

method at the end, which
was already done for us.

The average score on this problem was 61%.

The DataFrame below shows the distribution of
`"BodyStyle"`

for all `"Brands"`

in
`evs`

, other than Nissan. We will call EVs made by a
`"Brand"`

other than Nissan “non-Nissan EVs”; there are 24
non-Nissan EVs in evs.

Use the information above to answer the following questions.

Suppose we randomly select one of the non-Nissan EVs and it is either
an SUV or a sedan. What is the most likely `"Brand"`

of the
randomly selected non-Nissan EV?

Tesla

BMW

Audi

**Answer:** Audi

Let’s compute the number of EVs that are either SUVs or sedans for each non-Nissan Brand. (To do this, we’ll look at the right-most two columns in the DataFrame provided.)

- Number of Teslas that are SUVs or sedans: 4 + 3 = 7
- Number of BMWs that are SUVs or sedans: 1 + 1 = 2
- Number of Audis that are SUVs or sedans: 8 + 1 = 9

Since Audi is the `"Brand"`

with the most total SUVs and
sedans, it is the most likely `"Brand"`

to be selected.

Note: You *could* compute conditional probabilities for each
brand, if you’d like, by dividing the counts above by 18 (the total
number of SUVs and sedans). For instance, P(\text{EV is a BMW given that EV is an SUV or
sedan}) = \frac{2}{18}. The `"Brand"`

with the highest
count (Audi, with 9 SUVs or sedans) is also the `"Brand"`

with the highest conditional probability of being selected given that
the selected car is an SUV or sedan (Audi, with \frac{9}{18}).

The average score on this problem was 88%.

Suppose we randomly select two of the non-Nissan EVs without replacement. The probability that both are BMWs is equal to \frac{1}{k} , where k is a positive integer. What is k?

8

56

64

84

92

108

**Answer:** 92

In the first selection, the probability of selecting a BMW is \frac{1+1+1}{24} = \frac{3}{24} (3 is the total number of EVs that are BMW, and 24 is the total number of non-Nissan EVs as given by the question).

In the second selection, since we select **without
replacement**, there are only 23 EVs we can select from. Given
that in the first selection we already selected 1 BMW, there are only 2
BMWs left among the 23 EVs left. Thus, the probability of getting a BMW
in the second selection is \frac{2}{23}.

Putting this all together, the probability that both selections are BMWs is

\frac{3}{24}\cdot\frac{2}{23} = \frac{6}{24} \cdot \frac{1}{23} =\frac{1}{4} \cdot \frac{1}{23} = \frac{1}{92}

So, k = 92.

The average score on this problem was 67%.

Suppose we randomly select one of the non-Nissan EVs and it is an SUV. What is the probability that it is made by Tesla? Give your answer as a simplified fraction.

**Answer:** \frac{4}{13}

The question is asking for the proportion of SUVs that are made by Tesla.

We first need to find the number of SUVs in the DataFrame provided, which is 4 + 1 + 8 = 13. Of those 13 SUVs, 4 are made by Tesla. Thus, the proportion of SUVs made by Tesla is \frac{4}{13}, so the probability that a randomly selected SUV is made by Tesla is \frac{4}{13}.

The average score on this problem was 66%.

Below, we provide the same DataFrame as shown at the start of the previous problem, which contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan.

Suppose we’ve run the following few lines of code.

```
= evs[evs.get("Brand") == "Tesla"]
tesla = evs[evs.get("Brand") == "BMW"]
bmw = evs[evs.get("Brand") == "Audi"]
audi
= tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle") combo
```

How many rows does the DataFrame `combo`

have?

21

24

35

65

72

96

**Answer:** 35

Let’s attempt this problem step-by-step. We’ll first determine the
number of rows in `tesla.merge(bmw, on="BodyStyle")`

, and
then determine the number of rows in `combo`

. **For the
purposes of the solution, let’s use temp to refer to the
first merged DataFrame,
tesla.merge(bmw, on="BodyStyle").**

Recall, when we `merge`

two DataFrames, the resulting
DataFrame contains a single row for every match between the two columns,
and rows in either DataFrame without a match disappear. In this problem,
the column that we’re looking for matches in is
`"BodyStyle"`

.

To determine the number of rows of `temp`

, we need to
determine which rows of `tesla`

have a
`"BodyStyle"`

that matches a row in `bmw`

. From
the DataFrame provided, we can see that the only
`"BodyStyle"`

s in both `tesla`

and
`bmw`

are SUV and sedan. When we merge `tesla`

and
`bmw`

on `"BodyStyle"`

:

- The 4 SUV rows in
`tesla`

each match the 1 SUV row in`bmw`

. This will create 4 SUV rows in`temp`

. - The 3 sedan rows in
`tesla`

each match the 1 sedan row in`bmw`

. This will create 3 sedan rows in`temp`

.

So, `temp`

is a DataFrame with a total of 7 rows, with 4
rows for SUVs and 3 rows for sedans (in the `"BodyStyle"`

)
column. Now, when we merge `temp`

and `audi`

on
`"BodyStyle"`

:

- The 4 SUV rows in
`temp`

each match the 8 SUV rows in`audi`

. This will create 4 \cdot 8 = 32 SUV rows in`combo`

. - The 3 sedan rows in
`temp`

each match the 1 sedan row in`audi`

. This will create 3 \cdot 1 = 3 sedan rows in`combo`

.

Thus, the total number of rows in `combo`

is 32 + 3 = 35.

Note: You may notice that 35 is the result of multiplying the
`"SUV"`

and `"Sedan"`

columns in the DataFrame
provided, and adding up the results. This problem is similar to Problem 5 from the Fall 2021
Midterm.

The average score on this problem was 45%.

TritonTrucks is an EV startup run by UCSD alumni. Their signature EV, the TritonTruck, has a subpar battery (the engineers didn’t pay attention in their Chemistry courses).

A new TritonTruck’s battery needs to be replaced after 500 days, unless it fails first, in which case it needs to be replaced immediately. On any given day, the probability that a given TritonTruck’s battery fails is 0.02, independent of all other days.

Fill in the blanks so that
`average_days_until_replacement`

is an estimate of the
**average number of days a new TritonTruck’s battery lasts without
needing to be replaced**.

```
def days_until_replacement(daily_conditions):
= 0
days for i in __(a)__:
if daily_conditions[i] == True:
__(b)__else:
return days
return days
= 0
total = 10000
repetitions for i in np.arange(repetitions):
# The first element of the first argument to np.random.choice is
# chosen with probability 0.98
= np.random.choice(__(c)__, 500, p=[0.98, 0.02])
daily_conditions = total + days_until_replacement(daily_conditions)
total = total / repetitions average_days_until_replacement
```

What goes in blanks (a), (b), and (c)?

**Answer:**

- Blank (a):
`np.arange(len(daily_conditions))`

- Blank (b):
`days = days + 1`

- Blank (c):
`[True, False]`

At a high-level, here’s how this code block works:

`daily_conditions`

is an array of length 500, in each each element is`True`

with probability 0.98 and`False`

with probability 0.02. Each element of`daily_conditions`

is meant to represent whether or not the given TritonTruck’s battery failed on that day. For instance, if the first four elements of`daily_conditions`

are`[True, True, False, True, ...]`

, it means the battery was fine the first two days, but failed on the third day.- The function
`days_until_replacement`

takes in`daily_conditions`

and returns the number of days until the battery failed for the first time. In other words, it returns the number of elements before the first`False`

in`daily_conditions`

. In the example above, where the first four elements of`daily_conditions`

are`[True, True, False, True, ...]`

,`days_until_replacement`

would return 2, since the battery lasted 2 days until it needed to be replaced. It doesn’t matter what is in`daily_conditions`

after the first`False`

.

With that in mind, let’s fill in the pieces.

Blank (a): We need to loop over all elements in

`daily_conditions`

. There are two ways to do this, in theory – by looping over the elements themselves (e.g.`for cond in daily_conditions`

) or their positions (e.g.`for i in np.arange(len(daily_conditions))`

). However, here we*must*loop over their positions, because the body of`days_until_replacement`

uses`daily_conditions[i]`

, which only makes sense if`i`

is the position of an element in`daily_conditions`

. Since`daily_conditions`

has 500 elements, the possible positions are 0, 1, 2, …, 499. Since`len(daily_conditions)`

is 500, both`np.arange(len(daily_conditions))`

and`np.arange(500)`

yield the same correct result here.Blank (b):

`days`

is the “counter” variable that is being used to keep track of the number of days the battery lasted before failing. If`daily_conditions[i]`

is`True`

, it means that the battery lasted another day without failing, and so 1 needs to be added to`days`

. As such, the correct answer here is`days = days + 1`

(or`days += 1`

). (If`daily_conditions[i]`

is`False`

, then the battery has failed, and so we return the number of days until the first failure.)Blank (c): This must be

`[True, False]`

, as mentioned above. There are other valid answers too, including`np.array([True, False])`

and`[1, 0]`

.

The average score on this problem was 56%.

Histograms A and B below both display the distribution of the
`"Seats"`

column, using different bins. Each histogram
includes all 32 rows of `evs`

.

How many EVs in evs have exactly 6 seats?

**Answer:** 3

Here are two ways to solve this problem. In both solutions, we only
look at Histogram A, since only Histogram A contains a bin that
corresponds to EVs with exactly 6 `"Seats"`

. Recall, in
histograms, bins are inclusive of the left endpoint and exclusive of the
right endpoint, which means the [6, 7) bin represents EVs with >= 6
`"Seats"`

and < 7 `"Seats"`

; since the number
of `"Seats"`

is a whole number, this corresponds to exactly 6
`"Seats"`

.

**Solution 1**

Since the bin [6, 7) has a width of 1, its height is equal to its area, which is equal to the proportion of values in that bin. There are 32 values total, so all proportions (and, thus, the height of the [6, 7) bar) must be a multiple of \frac{1}{32}. The height is close to but just under 0.01; this implies the height is \frac{3}{32}, since that is also close to but just under 0.01 (\frac{4}{32} is way above and \frac{2}{32} is way below). So, we conclude that the number of EVs with 6 seats is 3.

**Solution 2**

The height of the [6, 7) bar is ever-so-slightly less than 0.1. If it’s height was 0.1, it would imply that the proportion of values in the [6, 7) bin was 0.1 \cdot (7 - 6) = 0.1, which would imply that the number of values in the [6, 7) bin is 0.1 \cdot 32 = 3.2. However, since the number of values in a bin must be an integer, the number of values in this bin is 3 (which is slightly less than 3.2).

The average score on this problem was 61%.

How many EVs in evs have exactly 5 seats?

**Answer:** 22

Now, we must look at Histogram B. In the previous part, we computed
that there are 3 EVs with exactly 6 `"Seats"`

in
`evs`

. Histogram B shows us the proportion, and thus number,
of EVs with 5 or 6 `"Seats"`

, through its [5, 7) bin
(remember, this bin corresponds to EVs with >= 5 `"Seats"`

and < 7 `"Seats"`

). If we can find the number of EVs with
5 or 6 `"Seats"`

, we can subtract 3 from it to determine the
number of EVs with exactly 5 `"Seats"`

.

Since it’s not quite clear what the height of the [5, 7) bar is, we
can create a *range* for the height of the bar, and use that to
come up with a range for the area of the bar, and hence a range for the
number of values in the bin. We can then use the fact that the number of
values in a bin must be an integer to narrow down our answer.

- The height of the [5, 7) bar is less than 0.4. This means that the
area of the [5, 7) bar is less than (7 - 5)
\cdot 0.4 = 0.8, which means that the proportion of EVs with 5 or
6
`"Seats"`

is less than 0.8, and the number of EVs with 5 or 6`"Seats"`

is less than 0.8 \cdot 32 = 25.6. - The height of the [5, 7) bar is a bit above 0.375, which is the
midpoint between 0.35 and 0.4 on the y-axis. This means that the area of the [5,
7) bar is more than (7 - 5) \cdot 0.375 =
0.75, which means that the number of EVs with 5 or 6
`"Seats"`

is more than 0.75 \cdot 32 = 24.

We’ve found that the number of EVs with 5 or 6 `"Seats"`

is more than 24 and less than 25.6. There is only one integer in this
range – 25 – so the number of EVs with 5 or 6 `"Seats"`

is
25. Finally, the number of EVs with exactly 5 seats is 25 - 3 = 22.

The average score on this problem was 35%.

Histogram C also displays the distribution of the
`"Seats"`

column, but uses just a single bin, [4, 9]. What is
the height of the sole bar in Histogram C?

**Answer:** \frac{1}{5}
(or 0.2)

Recall, the total area of a (density) histogram is 1. Since Histogram C only has one bar, the area of that one bar must be 1. we can use this fact to find what its height must be.

\begin{align*} \text{Area} &= \text{Width} \cdot \text{Height} \\ 1 &= (9 - 4) \cdot \text{Height} \\ \frac{1}{5} &= \text{Height} \end{align*}

The average score on this problem was 68%.