← return to practice.dsc10.com
These problems are taken from past quizzes and exams. Work on them
on paper, since the quizzes and exams you take in this
course will also be on paper.
We encourage you to complete these
problems during discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all of
these problems during the discussion section; the problems we don’t
cover can be used for extra practice.
For the problems that follow, we will work with a dataset consisting
of various skyscrapers in the US, which we’ve loaded into a DataFrame
called sky
. The first few rows of sky
are
shown below (though the full DataFrame has more rows):
Each row of sky
corresponds to a single skyscraper. For
each skyscraper, we have:
its name, which is stored in the index of sky
(string)
the 'material'
it is made up of (string)
the 'city'
in the US where it is located
(string)
the number of 'floors'
(levels) it contains
(int)
its 'height'
in meters (float), and
the 'year'
in which it was opened (int)
Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.
'height') sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: DataFrame
sky
is a DataFrame. All the sort_values
method does is change the order of the rows in the Series/DataFrame it
is called on, it does not change the data structure. As such,
sky.sort_values('height')
is also a DataFrame.
The average score on this problem was 87%.
'height').get('material').loc[0] sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: error
sky.sort_values('height')
is a DataFrame, and
sky.sort_values('height').get('material')
is a Series
corresponding to the 'material'
column, sorted by
'height'
in increasing order. So far, there are no
errors.
Remember, the .loc
accessor is used to access
elements in a Series based on their index.
sky.sort_values('height').get('material').loc[0]
is asking
for the element in the
sky.sort_values('height').get('material')
Series with index
0. However, the index of sky
is made up of building names.
Since there is no building named 0
, .loc[0]
causes an error.
The average score on this problem was 79%.
'height').get('material').iloc[0] sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: string
As we mentioned above,
sky.sort_values('height').get('material')
is a Series
containing values from the 'material'
column (but sorted).
Remember, there is no element in this Series with an index of 0, so
sky.sort_values('height').get('material').loc[0]
errors.
However, .iloc[0]
works differently than
.loc[0]
; .iloc[0]
will give us the first
element in a Series (independent of what’s in the index). So,
sky.sort_values('height').get('material').iloc[0]
gives us
back a value from the 'material'
column, which is made up
of strings, so it gives us a string. (Specifically, it gives us the
'material'
type of the skyscraper with the smallest
'height'
.)
The average score on this problem was 89%.
'floors').max() sky.get(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: int or float
The Series sky.get('floors')
is made up of integers, and
sky.get('floors').max()
evaluates to the largest number in
the Series, which is also an integer.
The average score on this problem was 91%.
0] sky.index[
int or float
Boolean
string
array
Series
DataFrame
error
Answer: string
sky.index
contains the values
'Bayard-Condict Building'
,
'The Yacht Club at Portofino'
,
'City Investing Building'
, etc. sky.index[0]
is then 'Bayard-Condict Building'
, which is a string.
The average score on this problem was 91%.
Write a single line of code that evaluates to the name of the tallest
skyscraper in the sky
DataFrame.
Answer:
sky.sort_values(by='height', ascending=False).index[0]
In order to answer this question, we must first sort the values of
the column we are interested in. As such, we sort the entire DataFrame
by the height
column, and because we are interested in the
name of the tallest building, we should set the ascending
parameter to False
because we would like the heights to be
ordered in descending order, thus leading to the line
sky.sort_values(by='height', ascending=False)
. After
sorting in descending order, we know that the tallest building is going
to be the first row of the new sky
DataFrame, and thus we
now only need to get the name of the skyscraper, which happens to be in
the index. In order to access the index of the DataFrame we can use
sky.index
, and in our case because we know that we want the
first index, we would need to write sky.index[0]
. Finally,
putting it all together, in order to get the name of the tallest
skyscraper in the sky
DataFrame, we would need to write
sky.sort_values(by='Height', ascending=False).index[0]
.
Write a single line of code that evaluates to the average number of floors across all skyscrapers in the DataFrame.
Answer: sky.get('floors').mean()
In order to answer the question, we must first figure out how to get
the number of floors each skyscraper has. We can do this with a line of
code like sky.get('floors')
which will get the number of
floors each skyscraper has. After doing this, we now need to find out
the average number of floors each skyscraper has. We can do this by
using the .mean()
method, which in our case will get the
average number of floors each skyscraper has. Putting this all togther,
we get a line of code that looks like
sky.get('floors').mean()
.
Suppose students
is a DataFrame of all students who took
DSC 10 last quarter. students
has one row per student,
where:
The index contains students’ PIDs as strings starting with
"A"
.
The "Overall"
column contains students’ overall
percentage grades as floats.
The "Animal"
column contains students’ favorite
animals as strings.
What type is students.get("Overall")
? If this expression
errors, select “this errors."
float
string
array
Series
this errors
Answer: Series
The average score on this problem was 73%.
What type is students.get("PID")
? If this expression
errors, select “this errors."
float
string
array
Series
this errors
Answer: this errors
The average score on this problem was 67%.
Vanessa is one student who took DSC 10 last quarter. Her PID is A12345678, she earned the sixth-highest overall percentage grade in the class, and her favorite animal is the giraffe.
Supposing that students
is already sorted by
"Overall"
in descending order, fill in the
blanks so that animal_one
and animal_two
both evaluate to "giraffe"
.
= students.get(__(x)__).loc[__(y)__]
animal_one = students.get(__(x)__).iloc[__(z)__] animal_two
Answer:
x
: "Animal"
y
: "A12345678"
z
: 5
The average score on this problem was 69%.
If students
wasn’t already sorted by
"Overall"
in descending order, which of your answers would
need to change?
Neither y
nor z
would need to change
Both y
and z
would need to change
y
only
z
only
Answer: z
only
The average score on this problem was 82%.
You are given a DataFrame called sports
, indexed by
'Sport'
containing one column,
'PlayersPerTeam'
. The first few rows of the DataFrame are
shown below:
Sport | PlayersPerTeam |
---|---|
baseball | 9 |
basketball | 5 |
field hockey | 11 |
Which of the following evaluates to
'basketball'
?
sports.loc[1]
sports.iloc[1]
sports.index[1]
sports.get('Sport').iloc[1]
Answer: sports.index[1]
We are told that the DataFrame is indexed by 'Sport'
and
'basketball'
is one of the elements of the index. To access
an element of the index, we use .index
to extract the index
and square brackets to extract an element at a certain position.
Therefore, sports.index[1]
will evaluate to
'basketball'
.
The first two answer choices attempt to use .loc
or
.iloc
directly on a DataFrame. We typically use
.loc
or .iloc
on a Series that results from
using .get
on some column. Although we don’t typically do
it this way, it is possible to use .loc
or
.iloc
directly on a DataFrame, but doing so would produce
an entire row of the DataFrame. Since we want just one word,
'basketball'
, the first two answer choices must be
incorrect.
The last answer choice is incorrect because we can’t use
.get
with the index, only with a column. The index is never
considered a column.
The average score on this problem was 88%.
Suppose you are given a DataFrame of employees for a given company.
The DataFrame, called employees
, is indexed by
'employee_id'
(string) with a column called
'years'
(int) that contains the number of years each
employee has worked for the company.
Suppose that the code
='years', ascending=False).index[0] employees.sort_values(by
outputs '2476'
.
True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.
True
False
Answer: False
This is false because there could be other employees who worked at the company equally long as employee 2476.
The code says that when the employees
DataFrame is
sorted in descending order of 'years'
, employee 2476 is in
the first row. There might, however, be a tie among several employees
for their value of 'years'
. In that case, employee 2476 may
wind up in the first row of the sorted DataFrame, but we cannot say that
the number of years employee 2476 has worked for the company is greater
than the number of years that any other employee has worked for the
company.
If the statement had said greater than or equal to instead of greater than, the statement would have been true.
The average score on this problem was 29%.
What will be the output of the following code?
=2021-employees.get('years'))
employees.assign(start='start').index.iloc[-1] employees.sort_values(by
the employee id of an employee who has worked there for the most years
the employee id of an employee who has worked there for the fewest years
an error message complaining about iloc[-1]
an error message complaining about something else
Answer: an error message complaining about something else
The problem is that the first line of code does not actually add a
new column to the employees
DataFrame because the
expression is not saved. So the second line tries to sort by a column,
'start'
, that doesn’t exist in the employees
DataFrame and runs into an error when it can’t find a column by that
name.
This code also has a problem with iloc[-1]
, since
iloc
cannot be used on the index, but since the problem
with the missing 'start'
column is encountered first, that
will be the error message displayed.
The average score on this problem was 27%.
Suppose flower_data
is a DataFrame with information on
different species of flowers, where:
The "species"
column contains the name of the
species of flower, as a string. Each value in this column is
unique.
The "petals"
column contains the average number of
petals of flowers of this species, as an int
.
The "length"
column contains the average stem length
of flowers of this species in inches, as a float
.
One of these three columns is a good choice to use as the index of
this DataFrame. Write a line of code that sets this column as the index
of flower_data
, and assigns the resulting DataFrame to the
variable flowers
.
Answer:
flowers = flower_data.set_index("species")
The average score on this problem was 79%.
Important: The following questions will use
flowers
instead of flower_data
.
Which of the following expressions evaluates to a DataFrame that is
sorted by "petals"
in descending order?
flowers.sort_values(by = "petals", ascending = True)
flowers.sort_values(by = "petals", ascending = False)
flowers.get("petals").sort_values(ascending = True)
flowers.get("petals").sort_values(ascending = False)
Answer: Option B
The average score on this problem was 83%.
Suppose that the 4th row of flowers
corresponds to a
rare species of flower named "fire lily"
. Fill in the
blanks below so that both of these expressions evaluate to the stem
length in inches of "fire lily"
.
i. flowers.get("length").loc[__(x)__]
ii. flowers.get("length").iloc[__(y)__]
Answer: (x): "fire lily"
, (y):
3
The average score on this problem was 83%.
Suppose that the 3rd row of flowers
corresponds to the
species "stinking corpse lily"
. Using the
flowers
DataFrame and the string method
.split()
, write an expression that evaluates to
"corpse"
.
Answer:
flowers.index[2].split(" ")[1]
The average score on this problem was 46%.
An art museum records information about its collection in a DataFrame
called art
. The columns of art
are as
follows:
"title" (str)
: the name of the art piece."artist" (str)
: the name of the artist."year" (int)
: the year the art piece was produced."price" (float)
: the selling price of the art piece in
dollars Write an expression that evaluates to the number of art pieces made in 1950 that cost less than $10,000.
Answer:
art[(art.get("year") == 1950) & (art.get("price") < 10000)].shape[0]
The average score on this problem was 72%.
The laptops
DataFrame contains information on various
factors that influence the pricing of laptops. Each row represents a
laptop, and the columns are:
"Mfr" (str)
: the company that manufactures the laptop,
like “Apple” or “Dell”."Model" (str)
: the model name of the laptop, such as
“MacBook Pro”."OS" (str)
: the operating system, such as “macOS” or
“Windows 11”."Screen Size" (float)
: the diagonal length of the
screen, in inches."Price" (float)
: the price of the laptop, in dollars.
Fill in the blanks so that rotten_apple
evaluates to the
number of laptops manufactured by "Apple"
that are priced
below the median price of all laptops.
x = __(a)__
y = __(b)__
rotten_apple = laptops[x __(c)__ y].__(d)__
Note: (a) and (b) are interchangeable
Answer (a):
laptops.get("Mfr") == "Apple"
The average score on this problem was 71%.
Answer (b):
laptops.get("Price") < laptops.get("Price").median()
The average score on this problem was 71%.
Answer (c): &
The average score on this problem was 43%.
Answer (d): shape[0]
The average score on this problem was 43%.
The DataFrame items
describes various items available to
collect or purchase using bells, the currency used in the game
Animal Crossing: New Horizons.
For each item, we have:
"Item" (str)
: The name of the item."Cost" (int)
: The cost of the item in bells. Items that
cost 0 bells cannot be purchased and must be collected through other
means (such as crafting)."Location" (str)
: The store or action through which the
item can be obtained.The first 6 rows of items
are below, though
items
has more rows than are shown here.
Fill in the blanks so that count_1
and
count_2
both evaluate to the number of items in
items
with a "Cost"
of 0.
= items.groupby(__(a)__).__(b)__().get("Item").loc[__(c)__]
count_1 = items[__(d)__].shape[0] count_2
Answer:
a
: "Cost"
b
: count
c
: 0
d
: items.get("Cost") == 0
The average score on this problem was 81%.
You are given a DataFrame called books
that contains
columns 'author'
(string), 'title'
(string),
'num_chapters'
(int), and 'publication_year'
(int).
Suppose that after doing books.groupby('Author').max()
,
one row says
author | title | num_chapters | publication_year |
---|---|---|---|
Charles Dickens | Oliver Twist | 53 | 1838 |
Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?
Yes
No
Answer: No
When we group by 'Author'
, all books by the same author
get aggregated together into a single row. The aggregation function is
applied separately to each other column besides the column we’re
grouping by. Since we’re grouping by 'Author'
here, the
'Author'
column never has the max()
function
applied to it. Instead, each unique value in the 'Author'
column becomes a value in the index of the grouped DataFrame. We are
told that the Charles Dickens row is just one row of the output, but we
don’t know anything about the other rows of the output, or the other
authors. We can’t say anything about where Charles Dickens falls when
authors are ordered alphabetically (but it’s probably not last!)
The average score on this problem was 94%.
Based on this data, can you conclude that Charles Dickens wrote Oliver Twist?
Yes
No
Answer: Yes
Grouping by 'Author'
collapses all books written by the
same author into a single row. Since we’re applying the
max()
function to aggregate these books, we can conclude
that Oliver Twist is alphabetically last among all books in the
books
DataFrame written by Charles Dickens. So Charles
Dickens did write Oliver Twist based on this data.
The average score on this problem was 95%.
Based on this data, can you conclude that Oliver Twist has 53 chapters?
Yes
No
Answer: No
The key to this problem is that groupby
applies the
aggregation function, max()
in this case, independently to
each column. The output should be interpreted as follows:
books
written by Charles Dickens,
Oliver Twist is the title that is alphabetically last.books
written by Charles Dickens, 53
is the greatest number of chapters.books
written by Charles Dickens,
1838 is the latest year of publication.However, the book titled Oliver Twist, the book with 53 chapters, and the book published in 1838 are not necessarily all the same book. We cannot conclude, based on this data, that Oliver Twist has 53 chapters.
The average score on this problem was 74%.
Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?
Yes
No
Answer: No
As explained in the previous question, the max()
function is applied separately to each column, so the book written by
Charles Dickens with 53 chapters may not be the same book as the book
written by Charles Dickens published in 1838.
The average score on this problem was 73%.
King Triton, UCSD’s mascot, is quite the traveler! For this question,
we will be working with the flights
DataFrame, which
details several facts about each of the flights that King Triton has
been on over the past few years. The first few rows of
flights
are shown below.
Here’s a description of the columns in flights
:
'DATE'
: the date on which the flight occurred. Assume
that there were no “redeye” flights that spanned multiple days.'FLIGHT'
: the flight number. Note that this is not
unique; airlines reuse flight numbers on a daily basis.'FROM'
and 'TO'
: the 3-letter airport code
for the departure and arrival airports, respectively. Note that it’s not
possible to have a flight from and to the same airport.'DIST'
: the distance of the flight, in miles.'HOURS'
: the length of the flight, in hours.'SEAT'
: the kind of seat King Triton sat in on the
flight; the only possible values are 'WINDOW'
,
'MIDDLE'
, and 'AISLE'
. Which of these correctly evaluates to the number of flights King
Triton took to San Diego (airport code 'SAN'
)?
flights.loc['SAN'].shape[0]
flights[flights.get('TO') == 'SAN'].shape[0]
flights[flights.get('TO') == 'SAN'].shape[1]
len(flights.sort_values('TO', ascending=False).loc['SAN'])
Answer:
flights[flights.get('TO') == 'SAN'].shape[0]
The strategy is to create a DataFrame with only the flights that went
to San Diego, then count the number of rows. The first step is to query
with the condition flights.get('TO') == 'SAN'
and the
second step is to extract the number of rows with
.shape[0]
.
Some of the other answer choices use .loc['SAN']
but
.loc
only works with the index, and flights
does not have airport codes in its index.
The average score on this problem was 95%.
Fill in the blanks below so that the result also evaluates to the
number of flights King Triton took to San Diego (airport code
'SAN'
).
'FLIGHT').__(b)__ flights.groupby(__(a)__).count().get(
What goes in blank (a)?
'DATE'
'FLIGHT'
'FROM'
'TO'
What goes in blank (b)?
.index[0]
.index[-1]
.loc['SAN']
.iloc['SAN']
.iloc[0]
True or False: If we change .get('FLIGHT')
to
.get('SEAT')
, the results of the above code block will not
change. (You may assume you answered the previous two subparts
correctly.)
True
False
Answer: 'TO'
,
.loc['SAN']
, True
The strategy here is to group all of King Triton’s flights according
to where they landed, and count up the number that landed in San Diego.
The expression flights.groupby('TO').count()
evaluates to a
DataFrame indexed by arrival airport where, for any arrival airport,
each column has a count of the number of King Triton’s flights that
landed at that airport. To get the count for San Diego, we need the
entry in any column for the row corresponding to San Diego. The code
.get('FLIGHT')
says we’ll use the 'FLIGHT'
column, but any other column would be equivalent. To access the entry of
this column corresponding to San Diego, we have to use .loc
because we know the name of the value in the index should be
'SAN'
, but we don’t know the row number or integer
position.
The average score on this problem was 89%.
Consider the DataFrame san
, defined below.
= flights[(flights.get('FROM') == 'SAN') & (flights.get('TO') == 'SAN')] san
Which of these DataFrames must have the same number
of rows as san
?
flights[(flights.get('FROM') == 'SAN') and (flights.get('TO') == 'SAN')]
flights[(flights.get('FROM') == 'SAN') | (flights.get('TO') == 'SAN')]
flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'SAN')]
flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]
Answer:
flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]
The DataFrame san
contains all rows of
flights
that have a departure airport of 'SAN'
and an arrival airport of 'SAN'
. But as you may know, and
as you’re told in the data description, there are no flights from an
airport to itself. So san
is actually an empty DataFrame
with no rows!
We just need to find which of the other DataFrames would necessarily
be empty, and we can see that
flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]
will be empty for the same reason.
Note that none of the other answer choices are correct. The first
option uses the Python keyword and
instead of the symbol
&
, which behaves unexpectedly but does not give an
empty DataFrame. The second option will be non-empty because it will
contain all flights that have San Diego as the departure airport or
arrival airport, and we already know from the first few rows of
flight
that there are some of these. The third option will
contain all the flights that King Triton has taken from
'LAX'
to 'SAN'
. Perhaps he’s never flown this
route, or perhaps he has. This DataFrame could be empty, but it’s not
necessarily going to be empty, as the question requires.
The average score on this problem was 70%.
In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).
The DataFrame evs
consists of 32 rows,
each of which contains information about a different EV model.
"Brand"
(str): The vehicle’s manufacturer."Model"
(str): The vehicle’s model name."BodyStyle"
(str): The vehicle’s body style."Seats"
(int): The vehicle’s number of seats."TopSpeed"
(int): The vehicle’s top speed, in
kilometers per hour."Range"
(int): The vehicle’s range, or distance it can
travel on a single charge, in kilometers.The first few rows of evs
are shown below (though
remember, evs
has 32 rows total).
Assume that:
"Brand"
column are
"Tesla"
, "BMW"
, "Audi"
, and
"Nissan"
.import babypandas as bpd
and
import numpy as np
.Suppose we’ve run the following line of code.
= evs.groupby("Brand").count() counts
What value does counts.get("Range").sum()
evaluate
to?
Answer: 32
counts
is a DataFrame with one row per
"Brand"
, since we grouped by "Brand"
. Since we
used the .count()
aggregation method, the columns in
counts
will all be the same – they will all contain the
number of rows in evs
for each "Brand"
(i.e. they will all contain the distribution of "Brand"
).
If we sum up the values in any one of the columns in
counts
, then, the result will be the total number of rows
in evs
, which we know to be 32. Thus,
counts.get("Range").sum()
is 32.
The average score on this problem was 56%.
What value does counts.index[3]
evaluate to?
Answer: "Tesla"
Since we grouped by "Brand"
to create
counts
, the index of counts
will be
"Brand"
, sorted alphabetically (this sorting happens
automatically when grouping). This means that counts.index
will be the array-like sequence
["Audi", "BMW", "Nissan", "Tesla"]
, and
counts.index[3]
is "Tesla"
.
The average score on this problem was 33%.
Consider the following incomplete assignment statement.
= evs______.mean() result
In each part, fill in the blank above so that result evaluates to the specified quantity.
A DataFrame, indexed by "Brand"
, whose
"Seats"
column contains the average number of
"Seats"
per "Brand"
. (The DataFrame may have
other columns in it as well.)
Answer: .groupby("Brand")
When we group by a column, the resulting DataFrame contains one row
for every unique value in that column. The question specified that we
wanted some information per "Brand"
, which implies
that grouping by "Brand"
is necessary.
After grouping, we need to use an aggregation method. Here, we wanted
the resulting DataFrame to be such that the "Seats"
column
contained the average number of "Seats"
per
"Brand"
; this is accomplished by using
.mean()
, which is already done for us.
Note: With the provided solution, the resulting DataFrame also has
other columns. For instance, it has a "Range"
column that
contains the average "Range"
for each "Brand"
.
That’s fine, since we were told that the resulting DataFrame may have
other columns in it as well. If we wanted to ensure that the only column
in the resulting DataFrame was "Seats"
, we could have used
.get(["Brand", "Seats"])
before grouping, though this was
not necessary.
The average score on this problem was 76%.
A number, corresponding to the average "TopSpeed"
of all
EVs manufactured by Audi in evs
Answer:
[evs.get("Brand") == "Audi"].get("TopSpeed")
There are two parts to this problem:
Querying, to make sure that we only keep the rows corresponding to Audis. This is accomplished by:
evs.get("Brand") == "Audi"
to create a Boolean
Series, with True
s for the rows we want to keep and
False
s for the other rows.True
. This is accomplished by
evs[evs.get("Brand") == "Audi"]
(though the
evs
part at the front was already provided).Accessing the "TopSpeed"
column. This is
accomplished by using .get("TopSpeed")
.
Then, evs[evs.get("Brand") == "Audi"].get("TopSpeed")
is
a Series contaning the "TopSpeed"
s of all Audis, and mean
of this Series is the result we’re looking for. The call to
.mean()
was already provided for us.
The average score on this problem was 77%.