← return to practice.dsc10.com
The problems in this worksheet are taken from past exams. Work on
them on paper, since the exams you take in this course
will also be on paper.
We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all
problems here in the live discussion section; the problems we don’t
cover can be used for extra practice.
Included is a DataFrame named sungod
that contains
information on the artists who have performed at Sun God in years past.
For each year that the festival was held, we have one row for
each artist that performed that year. The columns are:
'Year'
(int
): the year of the
festival'Artist'
(str
): the name of the
artist'Appearance_Order'
(int
): the order in
which the artist appeared in that year’s festival (1 means they came
onstage first)The rows of sungod
are arranged in no particular
order. The first few rows of sungod
are shown
below (though sungod
has many more rows
than pictured here).
Assume:
Only one artist ever appeared at a time (for example, we can’t
have two separate artists with a 'Year'
of 2015 and an
'Appearance_Order'
of 3).
An artist may appear in multiple different Sun God festivals (they could be invited back).
We have already run import babypandas as bpd
and
import numpy as np
.
Fill in the blank in the code below so that
chronological
is a DataFrame with the same rows as
sungod
, but ordered chronologically by appearance on stage.
That is, earlier years should come before later years, and within a
single year, artists should appear in the DataFrame in the order they
appeared on stage at Sun God. Note that groupby
automatically sorts the index in ascending order.
= sungod.groupby(___________).max().reset_index() chronological
['Year', 'Artist', 'Appearance_Order']
['Year', 'Appearance_Order']
['Appearance_Order', 'Year']
None of the above.
Answer:
['Year', 'Appearance_Order']
The fact that groupby
automatically sorts the index in
ascending order is important here. Since we want earlier years before
later years, we could group by 'Year'
, however if we
just group by year, all the artists who performed in a given
year will be aggregated together, which is not what we want. Within each
year, we want to organize the artists in ascending order of
'Appearance_Order'
. In other words, we need to group by
'Year'
with 'Appearance_Order'
as subgroups.
Therefore, the correct way to reorder the rows of sungod
as
desired is
sungod.groupby(['Year', 'Appearance_Order']).max().reset_index()
.
Note that we need to reset the index so that the resulting DataFrame has
'Year'
and 'Appearance_Order'
as columns, like
in sungod
.
The average score on this problem was 85%.
Another DataFrame called music
contains a row for every
music artist that has ever released a song. The columns are:
'Name'
(str
): the name of the music
artist'Genre'
(str
): the primary genre of the
artist'Top_Hit'
(str
): the most popular song by
that artist, based on sales, radio play, and streaming'Top_Hit_Year'
(int
): the year in which
the top hit song was releasedYou want to know how many musical genres have been represented at Sun
God since its inception in 1983. Which of the following expressions
produces a DataFrame called merged
that could help
determine the answer?
merged = sungod.merge(music, left_on='Year', right_on='Top_Hit_Year')
merged = music.merge(sungod, left_on='Year', right_on='Top_Hit_Year')
merged = sungod.merge(music, left_on='Artist', right_on='Name')
merged = music.merge(sungod, left_on='Artist', right_on='Name')
Answer:
merged = sungod.merge(music, left_on='Artist', right_on='Name')
The question we want to answer is about Sun God music artists’
genres. In order to answer, we’ll need a DataFrame consisting of rows of
artists that have performed at Sun God since its inception in 1983. If
we merge the sungod
DataFrame with the music
DataFrame based on the artist’s name, we’ll end up with a DataFrame
containing one row for each artist that has ever performed at Sun God.
Since the column containing artists’ names is called
'Artist'
in sungod
and 'Name'
in
music
, the correct syntax for this merge is
merged = sungod.merge(music, left_on='Artist', right_on='Name')
.
Note that we could also interchange the left DataFrame with the right
DataFrame, as swapping the roles of the two DataFrames in a merge only
changes the ordering of rows and columns in the output, not the data
itself. This can be written in code as
merged = music.merge(sungod, left_on='Name', right_on='Artist')
,
but this is not one of the answer choices.
The average score on this problem was 86%.
Consider an artist that has only appeared once at Sun God. At the time of their Sun God performance, we’ll call the artist
Complete the function below so it outputs the appropriate description for any input artist who has appeared exactly once at Sun God.
def classify_artist(artist):
= merged[merged.get('Artist') == artist]
filtered = filtered.get('Year').iloc[0]
year = filtered.get('Top_Hit_Year').iloc[0]
top_hit_year if ___(a)___ > 0:
return 'up-and-coming'
elif ___(b)___:
return 'outdated'
else:
return 'trending'
What goes in blank (a)?
Answer: top_hit_year - year
Before we can answer this question, we need to understand what the
first three lines of the classify_artist
function are
doing. The first line creates a DataFrame with only one row,
corresponding to the particular artist that’s passed in as input to the
function. We know there is just one row because we are told that the
artist being passed in as input has appeared exactly once at Sun God.
The next two lines create two variables:
year
contains the year in which the artist performed at
Sun God, andtop_hit_year
contains the year in which their top hit
song was released.Now, we can fill in blank (a). Notice that the body of the
if
clause is return 'up-and-coming'
. Therefore
we need a condition that corresponds to up-and-coming, which we are told
means the top hit came out after the artist appeared at Sun God. Using
the variables that have been defined for us, this condition is
top_hit_year > year
. However, the if
statement condition is already partially set up with > 0
included. We can simply rearrange our condition
top_hit_year > year
by subtracting year
from both sides to obtain top_hit_year - year > 0
, which
fits the desired format.
The average score on this problem was 89%.
What goes in blank (b)?
Answer: year-top_hit_year > 5
For this part, we need a condition that corresponds to an artist
being outdated which happens when their top hit came out more than five
years prior to their appearance at Sun God. There are several ways to
state this condition: year-top_hit_year > 5
,
year > top_hit_year + 5
, or any equivalent condition
would be considered correct.
The average score on this problem was 89%.
King Triton, UCSD’s mascot, is quite the traveler! For this question,
we will be working with the flights
DataFrame, which
details several facts about each of the flights that King Triton has
been on over the past few years. The first few rows of
flights
are shown below.
Here’s a description of the columns in flights
:
'DATE'
: the date on which the flight occurred. Assume
that there were no “redeye” flights that spanned multiple days.'FLIGHT'
: the flight number. Note that this is not
unique; airlines reuse flight numbers on a daily basis.'FROM'
and 'TO'
: the 3-letter airport code
for the departure and arrival airports, respectively. Note that it’s not
possible to have a flight from and to the same airport.'DIST'
: the distance of the flight, in miles.'HOURS'
: the length of the flight, in hours.'SEAT'
: the kind of seat King Triton sat in on the
flight; the only possible values are 'WINDOW'
,
'MIDDLE'
, and 'AISLE'
. Suppose we create a DataFrame called socal
containing
only King Triton’s flights departing from SAN, LAX, or SNA (John Wayne
Airport in Orange County). socal
has 10 rows; the bar chart
below shows how many of these 10 flights departed from each airport.
Consider the DataFrame that results from merging socal
with itself, as follows:
= socal.merge(socal, left_on='FROM', right_on='FROM') double_merge
How many rows does double_merge
have?
Answer: 38
There are two flights from LAX. When we merge socal
with
itself on the 'FROM'
column, each of these flights gets
paired up with each of these flights, for a total of four rows in the
output. That is, the first flight from LAX gets paired with both the
first and second flights from LAX. Similarly, the second flight from LAX
gets paired with both the first and second flights from LAX.
Following this logic, each of the five flights from SAN gets paired with each of the five flights from SAN, for an additional 25 rows in the output. For SNA, there will be 9 rows in the output. The total is therefore 2^2 + 5^2 + 3^2 = 4 + 25 + 9 = 38 rows.
The average score on this problem was 27%.
We define a “route” to be a departure and arrival airport pair. For
example, all flights from 'SFO'
to 'SAN'
make
up the “SFO to SAN route”. This is different from the “SAN to SFO
route”.
Fill in the blanks below so that
most_frequent.get('FROM').iloc[0]
and
most_frequent.get('TO').iloc[0]
correspond to the departure
and destination airports of the route that King Triton has spent the
most time flying on.
= flights.groupby(__(a)__).__(b)__
most_frequent = most_frequent.reset_index().sort_values(__(c)__) most_frequent
What goes in blank (a)?
Answer: ['FROM', 'TO']
We want to organize flights by route. This means we need to group by
both 'FROM'
and 'TO'
so any flights with the
same pair of departure and arrival airports get grouped together. To
group by multiple columns, we must use a list containing all these
column names, as in flights.groupby(['FROM', 'TO'])
.
The average score on this problem was 72%.
What goes in blank (b)?
count()
mean()
sum()
max()
Answer: sum()
Every .groupby
command needs an aggregation function!
Since we are asked to find the route that King Triton has spent the most
time flying on, we want to total the times for all flights on a given
route.
Note that .count()
would tell us how many flights King
Triton has taken on each route. That’s meaningful information, but not
what we need to address the question of which route he spent the most
time flying on.
The average score on this problem was 58%.
What goes in blank (c)?
by='HOURS', ascending=True
by='HOURS', ascending=False
by='HOURS', descending=True
by='DIST', ascending=False
Answer:
by='HOURS', ascending=False
We want to know the route that King Triton spent the most time flying
on. After we group flights by route, summing flights on the same route,
the 'HOURS'
column contains the total amount of time spent
on each route. We need most_frequent.get('FROM').iloc[0]
and most_frequent.get('TO').iloc[0]
to correspond with the
departure and destination airports of the route that King Triton has
spent the most time flying on. To do this, we need to sort in descending
order of time, to bring the largest time to the top of the DataFrame. So
we must sort by 'HOURS'
with
ascending=False
.
The average score on this problem was 94%.
For this question we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.
Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code 'LVA'
). In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.
The DataFrame plum
contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of plum
are shown below (though the full DataFrame has 31 rows, not 5):
Each row in plum
corresponds to a single game. For each
game, we have:
'Date'
(str
), the date on which the game
was played'Opp'
(str
), the three-letter code of the
opponent team'Home'
(bool
), True
if the
game was played in Las Vegas (“home”) and False
if it was
played at the opponent’s arena (“away”)'Won'
(bool
), True
if the Las
Vegas Aces won the game and False
if they lost'PTS'
(int
), the number of points Kelsey
Plum scored in the game'AST'
(int
), the number of assists
(passes) Kelsey Plum made in the game'TOV'
(int
), the number of turnovers
Kelsey Plum made in the game (a turnover is when you lose the ball –
turnovers are bad!) Consider the function unknown
, defined below.
def unknown(df):
= plum.groupby('Opp').max().get(['Date', 'PTS'])
grouped return np.array(grouped.reset_index().index)[df]
What does unknown(3)
evaluate to?
'2021-06-05'
'WAS'
The date on which Kelsey Plum scored the most points
The three-letter code of the opponent on which Kelsey Plum scored the most points
The number 0
The number 3
An error
Answer: The number 3
plum.groupby('Opp').max()
finds the largest value in the
'Date'
, 'Home'
, 'Won'
,
'PTS'
, 'AST'
, and 'TOV'
columns
for each unique 'Opp'
(independently for each column).
grouped = plum.groupby('Opp').max().get(['Date', 'PTS'])
keeps only the 'Date'
and 'PTS'
columns. Note
that in grouped
, the index is 'Opp'
, the
column we grouped on.
When grouped.reset_index()
is called, the index is
switched back to the default of 0, 1, 2, 3, 4, and so on. Then,
grouped.reset_index().index
is an Index
containing the numbers [0, 1, 2, 3, 4, ...]
, and
np.array(grouped.reset_index().index)
is
np.array([0, 1, 2, 3, 4, ...])
. In this array, the number
at position i
is just i
, so the number at
position df
is df
. Here, df
is
the argument to unknown
, and we were asked for the value of
unknown(3)
, so the correct answer is the number at position
3 in np.array([0, 1, 2, 3, 4, ...])
which is 3.
Note that if we asked for unknown(50)
(or
unknown(k)
, where k
is any integer above 30),
the answer would be “An error”, since grouped
could not
have had 51 rows. plum
has 31 rows, so grouped
has at most 31 rows (but likely less, since Kelsey Plum’s team likely
played the same opponent multiple times).
The average score on this problem was 72%.
For your convenience, we show the first few rows of plum
again below.
Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.
Consider the DataFrame home_won
, defined below.
= plum.groupby(['Home', 'Won']).mean().reset_index() home_won
How many rows does home_won
have?
How many columns does home_won
have?
Answer: 4 rows and 5 columns.
plum.groupby(['Home', 'Won']).mean()
contains one row
for every unique combination of 'Home'
and
'Won'
. There are two values of 'Home'
-
True
and False
– and two values of
'Won'
– True
and False
– leading
to 4 combinations. We can assume that there was at least one row in
plum
for each of these 4 combinations due to the assumption
given in the problem:
Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.
plum
started with 7 columns: 'Date'
,
'Opp'
, 'Home'
, 'Won'
,
'PTS'
, 'AST'
, and 'TOV'
. After
grouping by ['Home', 'Won']
and using .mean()
,
'Home'
and 'Won'
become the index. The
resulting DataFrame contains all of the columns that the
.mean()
aggregation method can work on. We cannot take the
mean of 'Date'
and 'Opp'
, because those
columns are strings, so
plum.groupby(['Home', 'Won']).mean()
contains a
MultiIndex
with 2 “columns” – 'Home'
and
'Won'
– and 3 regular columns – 'PTS'
'AST'
, and 'TOV'
. Then, when using
.reset_index()
, 'Home'
and 'Won'
are restored as regular columns, meaning that
plum.groupby(['Home', 'Won']).mean().reset_index()
has
2 + 3 = 5 columns.
The average score on this problem was 78%.
Consider the DataFrame home_won
once again.
= plum.groupby(['Home', 'Won']).mean().reset_index() home_won
Now consider the DataFrame puzzle
, defined below. Note
that the only difference between home_won
and
puzzle
is the use of .count()
instead of
.mean()
.
= plum.groupby(['Home', 'Won']).count().reset_index() puzzle
How do the number of rows and columns in home_won
compare to the number of rows and columns in puzzle
?
home_won
and puzzle
have the same number of
rows and columns
home_won
and puzzle
have the same number of
rows, but a different number of columns
home_won
and puzzle
have the same number of
columns, but a different number of rows
home_won
and puzzle
have both a different
number of rows and a different number of columns
Answer: home_won
and
puzzle
have the same number of rows, but a different number
of columns
All that changed between home_won
and
puzzle
is the aggregation method. The aggregation method
has no influence on the number of rows in the output DataFrame, as there
is still one row for each of the 4 unique combinations of
'Home'
and 'Won'
.
However, puzzle
has 7 columns, instead of 5. In the
solution to the above subpart, we noticed that we could not use
.mean()
on the 'Date'
and 'Opp'
columns, since they contained strings. However, we can use
.count()
(since .count()
just determines the
number of non-NA values in each group), and so the 'Date'
and 'Opp'
columns are not “lost” when aggregating. Hence,
puzzle
has 2 more columns than home_won
.
The average score on this problem was 85%.