← return to practice.dsc10.com

The problems in this worksheet are taken from past exams. Work on
them **on paper**, since the exams you take in this course
will also be on paper.

We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.**Note: We do not plan to cover all
problems here in the live discussion section**; the problems we don’t
cover can be used for extra practice.

For this question we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code `'LVA'`

). **In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.**

The DataFrame `plum`

contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of `plum`

are shown below (though the full DataFrame has 31 rows, not 5):

Each row in `plum`

corresponds to a single game. For each
game, we have:

`'Date'`

(`str`

), the date on which the game was played`'Opp'`

(`str`

), the three-letter code of the opponent team`'Home'`

(`bool`

),`True`

if the game was played in Las Vegas (“home”) and`False`

if it was played at the opponent’s arena (“away”)`'Won'`

(`bool`

),`True`

if the Las Vegas Aces won the game and`False`

if they lost`'PTS'`

(`int`

), the number of points Kelsey Plum scored in the game`'AST'`

(`int`

), the number of assists (passes) Kelsey Plum made in the game`'TOV'`

(`int`

), the number of turnovers Kelsey Plum made in the game (a turnover is when you lose the ball – turnovers are bad!)

What type of visualization is best suited for visualizing the trend in the number of points Kelsey Plum scored per game in 2021?

Histogram

Bar chart

Line chart

Scatter plot

**Answer: ** Line chart

Here, there are two quantitative variables (number of points and game number), and one of them involves some element of time (game number). Line charts are appropriate when one quantitative variable is time.

The average score on this problem was 75%.

Fill in the blanks below so that `total_june`

evaluates to
the total number of points Kelsey Plum scored in June.

```
= plum[__(a)__]
june_only = june_only.__(b)__ total_june
```

What goes in blank (a)?

What goes in blank (b)?

**Answer:**

`plum.get('Date').str.contains('-06-')`

`get('PTS').sum()`

To find the total number of points Kelsey Plum scored in June, one
approach is to first create a DataFrame with only the rows for June.
During the month of June, the `'Date'`

values contain
`'-06-'`

(since June is the 6th month), so
`plum.get('Date').str.contains('-06-')`

is a Series
containing `True`

only for the June rows and
`june_only = plum[plum.get('Date').str.contains('-06-')]`

is
a DataFrame containing only the June rows.

Then, all we need is the sum of the `'PTS'`

column, which
is given by `june_only.get('PTS').sum()`

.

The average score on this problem was 90%.

For your convenience, we show the first few rows of `plum`

again below.

There is exactly one team in the WNBA that Plum’s team did not win
any games against during the 2021 season. Fill in the blanks below so
that `never_beat`

evaluates to a string containing the
three-letter code of that team.

`= plum.groupby(__(a)__).sum().__(b)__ never_beat `

What goes in blank (a)?

What goes in blank (b)?

**Answer:**

`'Opp'`

`sort_values('Won').index[0]`

The key insight here is that the values in the `'Won'`

column are Boolean, and when Boolean values are used in arithmetic they
are treated as 1s (`True`

) and 0s (`False`

). The
`sum`

of several `'Won'`

values is the same as the
number of wins.

If we group `plum`

by `'Opp'`

and use
`.sum()`

, the resulting `'Won'`

column contains
the number of wins that Plum’s team had against each unique opponent. If
we sort this DataFrame by `'Won'`

in increasing order (which
is the default behavior of `sort_values`

), the row at the top
will correspond to the `'Opp'`

that Plum’s team had no wins
against. Since we grouped by `'Opp'`

, team names are stored
in the index, so `.index[0]`

will give us the name of the
desired team.

The average score on this problem was 67%.

Suppose `df`

is a DataFrame and `b`

is any
boolean array whose length is the same as the number of rows of
`df`

.

True or False: For any such boolean array `b`

,
`df[b].shape[0]`

is less than or equal to
`df.shape[0]`

.

True

False

**Answer: ** True

The brackets in `df[b]`

perform a query, or filter, to
keep only the rows of `df`

for which `b`

has a
`True`

entry. Typically, `b`

will come from some
condition, such as the entry in a certain column of `df`

equaling a certain value. Regardless, `df[b]`

contains a
subset of the rows of `df`

, and `.shape[0]`

counts
the number of rows, so `df[b].shape[0]`

must be less than or
equal to `df.shape[0]`

.

The average score on this problem was 86%.

You are given a DataFrame called `books`

that contains
columns `'author'`

(string), `'title'`

(string),
`'num_chapters'`

(int), and `'publication_year'`

(int).

Suppose that after doing `books.groupby('Author').max()`

,
one row says

author | title | num_chapters | publication_year |
---|---|---|---|

Charles Dickens | Oliver Twist | 53 | 1838 |

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?

Yes

No

**Answer: ** No

When we group by `'Author'`

, all books by the same author
get aggregated together into a single row. The aggregation function is
applied separately to each other column besides the column we’re
grouping by. Since we’re grouping by `'Author'`

here, the
`'Author'`

column never has the `max()`

function
applied to it. Instead, each unique value in the `'Author'`

column becomes a value in the index of the grouped DataFrame. We are
told that the Charles Dickens row is just one row of the output, but we
don’t know anything about the other rows of the output, or the other
authors. We can’t say anything about where Charles Dickens falls when
authors are ordered alphabetically (but it’s probably not last!)

The average score on this problem was 94%.

Based on this data, can you conclude that Charles Dickens wrote
*Oliver Twist*?

Yes

No

**Answer: ** Yes

Grouping by `'Author'`

collapses all books written by the
same author into a single row. Since we’re applying the
`max()`

function to aggregate these books, we can conclude
that *Oliver Twist* is alphabetically last among all books in the
`books`

DataFrame written by Charles Dickens. So Charles
Dickens did write *Oliver Twist* based on this data.

The average score on this problem was 95%.

Based on this data, can you conclude that *Oliver Twist* has
53 chapters?

Yes

No

**Answer: ** No

The key to this problem is that `groupby`

applies the
aggregation function, `max()`

in this case, independently to
each column. The output should be interpreted as follows:

- Among all books in
`books`

written by Charles Dickens,*Oliver Twist*is the title that is alphabetically last. - Among all books in
`books`

written by Charles Dickens, 53 is the greatest number of chapters. - Among all books in
`books`

written by Charles Dickens, 1838 is the latest year of publication.

However, the book titled *Oliver Twist*, the book with 53
chapters, and the book published in 1838 are not necessarily all the
same book. We cannot conclude, based on this data, that *Oliver
Twist* has 53 chapters.

The average score on this problem was 74%.

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?

Yes

No

**Answer: ** No

As explained in the previous question, the `max()`

function is applied separately to each column, so the book written by
Charles Dickens with 53 chapters may not be the same book as the book
written by Charles Dickens published in 1838.

The average score on this problem was 73%.

You are given a table called books that contains columns
`'author'`

(`str`

), `'title'`

(`str`

), `'num_chapters'`

(`int`

), and
`'publication_year'`

(`int`

).

What will be the output of the following code?
`books.groupby(“publication_year”).mean().shape[1]`

`1`

`2`

`3`

`4`

**Answer: ** `1`

The output will return 1. Notice that the final function call is to
`.shape[1]`

. We know that `.shape[1]`

is a call to
see how many columns are in the resulting data frame. When we group by
publication year, there is only one column that will be aggregated by
the groupby call (which is the `'num_chapters'`

column). The
other columns are string, and therefore, will not be aggregated in the
groupby call (since you can’t take the mean of a string). Consequently
`.shape[1]`

will only result one column for the mean of the
`'num_chapters'`

column.

The average score on this problem was 67%.

Which of the following strategies would work to compute the absolute difference in the average number of chapters per book for authors “Dean Koontz” and “Charles Dickens”?

group by

`'author'`

, aggregate with`.mean()`

, use`get`

on`'num_chapters'`

column compute the absolute value of the difference between`iloc["Charles Dickens"]`

and`iloc["Dean Koontz"]`

do two queries to get two separate tables (one for each of “Dean Koontz” and “Charles Dickens”), use

`get`

on the`'num_chapters'`

column of each table, use the Series method`.mean()`

on each, compute the absolute value of the difference in these two meansgroup by both

`'author'`

and`'title'`

, aggregate with`.mean()`

, use get on`'num_chapters'`

column, use`loc`

twice to find values in that column corresponding to “Dean Koontz” and “Charles Dickens”, compute the absolute value of the difference in these two valuesquery using a compound condition to get all books corresponding to “Dean Koontz” or “Charles Dickens”, group by

`'author'`

, aggregate with`.mean()`

, compute absolute value of the difference in`index[0]`

and`index[1]`

**Answer: ** do two queries to get two separate tables
(one for each of “Dean Koontz” and “Charles Dickens”), use
`get`

on the `'num_chapters'`

column of each
table, use the Series method `.mean()`

on each, compute the
absolute value of the difference in these two means

Logically, we want to somehow separate data for author “Dean Koontz”
and “Charles Dickens”. (If we don’t we’ll be taking a mean that includes
the chapters of books from both authors.) To achieve this separation, we
can create two separate tables with a query that specifies a value on
the `'author'`

column. Now having two separate tables, we can
aggregate on the `'num_chapters'`

(the column of interest).
To get the `'num_chapters'`

column we can use the
`get`

method. To actually acquire the mean of the
`'num_chapters'`

column we can evoke the `.mean()`

call.

The average score on this problem was 80%.

Which of the following will produce the same value as the total number of books in the table?

`books.groupby('Title').count().shape[0]`

`books.groupby('Author').count().shape[0]`

`books.groupby(['Author, 'Title']).count().shape[0]`

**Answer: **
`books.groupby(['Author, 'Title']).count().shape[0]`

The key in this question is to understand that different authors can
create books with the same name. The first two options check for each
unique book title (the first response) and check for each unique other
(the second response). To ensure we have all unique author and title
pairs we must group based on both `'Author'`

and
`'Title'`

. To actually get the number of rows we can take
`.shape[0]`

.

The average score on this problem was 56%.

You have a DataFrame called `prices`

that contains
information about food prices at 18 different grocery stores. There is
column called `'broccoli'`

that contains the price in dollars
for one pound of broccoli at each grocery store. There is also a column
called `'ice_cream'`

that contains the price in dollars for a
pint of store-brand ice cream.

Using the code,

`='hist', y='broccoli', bins=np.arange(0.8, 2.11, 0.1), density=True) prices.plot(kind`

we produced the histogram below:

How many grocery stores sold broccoli for a price greater than or equal to $1.30 per pound, but less than $1.40 per pound (the tallest bar)?

**Answer: ** 4 grocery stores

We are given that the bins start at 0.8 and have a width of 0.1, which means one of the bins has endpoints 1.3 and 1.4. This bin (the tallest bar) includes all grocery stores that sold broccoli for a price greater than or equal to $1.30 per pound, but less than $1.40 per pound.

This bar has a width of 0.1 and we’d estimate the height to be around 2.2, though we can’t say exactly. Multiplying these values, the area of the bar is about 0.22, which means about 22 percent of the grocery stores fall into this bin. There are 18 grocery stores in total, as we are told in the introduction to this question. We can compute using a calculator that 22 percent of 18 is 3.96. Since the actual number of grocery stores this represents must be a whole number, this bin must represent 4 grocery stores.

The reason for the slight discrepancy between 3.96 and 4 is that we used 2.2 for the height of the bar, a number that we determined by eye. We don’t know the exact height of the bar. It is reassuring to do the calculation and get a value that’s very close to an integer, since we know the final answer must be an integer.

The average score on this problem was 71%.

Suppose we now plot the same data with different bins, using the following line of code:

`='hist', y='broccoli', bins=[0.8, 1, 1.1, 1.5, 1.8, 1.9, 2.5], density=True) prices.plot(kind`

What would be the height on the y-axis for the bin corresponding to the interval [\$1.10, \$1.50)? Input your answer below.

**Answer: ** 1.25

First, we need to figure out how many grocery stores the bin [\$1.10, \$1.50) contains. We already know from the previous subpart that there are four grocery stores in the bin [\$1.30, \$1.40). We could do similar calculations to find the number of grocery stores in each of these bins:

- [\$1.10, \$1.20)
- [\$1.20, \$1.30)
- [\$1.40, \$1.50)

However, it’s much simpler and faster to use the fact that when the bins are all equally wide, the height of a bar is proportional to the number of data values it contains. So looking at the histogram in the previous subpart, since we know the [\$1.30, \$1.40) bin contains 4 grocery stores, then the [\$1.10, \$1.20) bin must contain 1 grocery store, since it’s only a quarter as tall. Again, we’re taking advantage of the fact that there must be an integer number of grocery stores in each bin when we say it’s 1/4 as tall. Our only options are 1/4, 1/2, or 3/4 as tall, and among those choices, it’s clear.

Therefore, by looking at the relative heights of the bars, we can quickly determine the number of grocery stores in each bin:

- [\$1.10, \$1.20): 1 grocery store
- [\$1.20, \$1.30): 3 grocery stores
- [\$1.30, \$1.40): 4 grocery stores
- [\$1.40, \$1.50): 1 grocery store

Adding these numbers together, this means there are 9 grocery stores whose broccoli prices fall in the interval [\$1.10, \$1.50). In the new histogram, these 9 grocery stores will be represented by a bar of width 1.50-1.10 = 0.4. The area of the bar should be \frac{9}{18} = 0.5. Therefore the height must be \frac{0.5}{0.4} = 1.25.

The average score on this problem was 33%.

You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?

`'broccoli', 'ice_cream']).plot(kind='barh') prices.get([`

Yes

No

**Answer: ** Yes

When we use `.plot`

without specifying a `y`

column, it uses every column in the DataFrame as a `y`

column
and creates an overlaid plot. Since we first use `get`

with
the list `['broccoli', 'ice_cream']`

, this keeps the
`'broccoli'`

and `'ice_cream'`

columns from
`prices`

, so our bar chart will overlay broccoli prices with
ice cream prices. Notice that this `get`

is unnecessary
because `prices`

only has these two columns, so it would have
been the same to just use `prices`

directly. The resulting
bar chart will look something like this:

Each grocery store has its broccoli price represented by the length of the blue bar and its ice cream price represented by the length of the red bar. We can therefore answer the question by simply counting the number of red bars that are shorter than their corresponding blue bars.

The average score on this problem was 78%.

You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?

`'broccoli', 'ice_cream']).plot(kind='hist') prices.get([`

Yes

No

**Answer: ** No

This will create an overlaid histogram of broccoli prices and ice cream prices. So we will be able to see the distribution of broccoli prices together with the distribution of ice cream prices, but we won’t be able to pair up particular broccoli prices with ice cream prices at the same store. This means we won’t be able to answer the question. The overlaid histogram would look something like this:

This tells us that broadly, ice cream tends to be more expensive than broccoli, but we can’t say anything about the number of stores where ice cream is cheaper than broccoli.

The average score on this problem was 81%.

Some code and the scatterplot that produced it is shown below:

`'broccoli', 'ice_cream']).plot(kind='scatter', x='broccoli', y='ice_cream')) (prices.get([`

Can you use this plot to figure out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli?

If so, say how many such stores there are and explain how you came to that conclusion.

If not, explain why this scatterplot cannot be used to answer the question.

**Answer: ** Yes, and there are 2 such stores.

In this scatterplot, each grocery store is represented as one dot. The x-coordinate of that dot tells the price of broccoli at that store, and the y-coordinate tells the price of ice cream. If a grocery store’s ice cream price is cheaper than its broccoli price, the dot in the scatterplot will have y<x. To identify such dots in the scatterplot, imagine drawing the line y=x. Any dot below this line corresponds to a point with y<x, which is a grocery store where ice cream is cheaper than broccoli. As we can see, there are two such stores.

The average score on this problem was 78%.

Included is a DataFrame named `sungod`

that contains
information on the artists who have performed at Sun God in years past.
**For each year that the festival was held, we have one row for
each artist that performed that year.** The columns are:

`'Year'`

(`int`

): the year of the festival`'Artist'`

(`str`

): the name of the artist`'Appearance_Order'`

(`int`

): the order in which the artist appeared in that year’s festival (1 means they came onstage first)

The rows of `sungod`

are arranged in **no particular
order**. The first few rows of `sungod`

are shown
below (though `sungod`

has **many more rows**
than pictured here).

Assume:

Only one artist ever appeared at a time (for example, we can’t have two separate artists with a

`'Year'`

of 2015 and an`'Appearance_Order'`

of 3).An artist may appear in multiple different Sun God festivals (they could be invited back).

We have already run

`import babypandas as bpd`

and`import numpy as np`

.

On the graph paper below, draw the histogram that would be produced by this code.

```
(5))
sungod.take(np.arange(='hist', density=True,
.plot(kind=np.arange(0, 7, 2), y='Appearance_Order');
bins )
```

In your drawing, make sure to label the height of each bar in the histogram on the vertical axis. You can scale the axes however you like, and the two axes don’t need to be on the same scale.

**Answer: **

To draw the histogram, we first need to bin the data and figure out
how many data values fall into each bin. The code includes
`bins=np.arange(0, 7, 2)`

which means the bin endpoints are
0, 2, 4, 6. This gives us three bins:
[0, 2), [2,
4), and [4, 6]. Remember that
all bins, except for the last one, include the left endpoint but not the
right. The last bin includes both endpoints.

Now that we know what the bins are, we can count up the number of
values in each bin. We need to look at the
`'Appearance_Order'`

column of
`sungod.take(np.arange(5))`

, or the first five rows of
`sungod`

. The values there are 1,
4, 3, 1, 3. The two 1s fall into
the first bin [0, 2). The two 3s fall into the second bin [2, 4), and the one 4 falls into the last bin [4, 6]. This means the proportion of values
in each bin are \frac{2}{5}, \frac{2}{5},
\frac{1}{5} from left to right.

To figure out the height of each bar in the histogram, we use the fact that the area of a bar in a density histogram should equal the proportion of values in that bin. The area of a rectangle is height times width, so height is area divided by width.

For the bin [0, 2), the area is \frac{2}{5} = 0.4 and the width is 2, so the height is \frac{0.4}{2} = 0.2.

For the bin [2, 4), the area is \frac{2}{5} = 0.4 and the width is 2, so the height is \frac{0.4}{2} = 0.2.

For the bin [4, 6], the area is \frac{1}{5} = 0.2 and the width is 2, so the height is \frac{0.2}{2} = 0.1.

Since the bins are all the same width, the fact that there an equal number of values in the first two bins and half as many in the third bin means the first two bars should be equally tall and the third should be half as tall. We can use this to draw the rest of the histogram quickly once we’ve drawn the first bar.

The average score on this problem was 45%.