← return to practice.dsc10.com
The problems in this worksheet are taken from past exams. Work on
them on paper, since the exams you take in this course
will also be on paper.
We encourage you to complete this
worksheet in a live discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all
problems here in the live discussion section; the problems we don’t
cover can be used for extra practice.
IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings.
An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
app_data
has a row for each product build that was logged
on the app. The columns are:
'product'
(str
): the name of the product,
which includes the product line as the first word, followed by a
description of the product'category'
(str
): a categorical
description of the type of product'assembly_time'
(str
): the amount of time
to assemble the product, formatted as 'x hr, y min'
where
x
and y
represent integers, possibly zeroThe first few rows of app_data
are shown below, though
app_data
has many more rows than pictured (5000 rows
total).
Assume that we have already run import babypandas as bpd
and import numpy as np
.
Suppose that when someone downloads the app, the app requires
them to choose a username, which must be different from all other
registered usernames.
True or False: If app_data
had included
a column with the username of the person who reported each product
build, it would make sense to index app_data
by
username.
True
False
Answer: False
Even though people must have distinct usernames, one person can build
multiple different IKEA products and log their time for each build. So
we don’t expect every row of app_data
to have a distinct
username associated with it, and therefore username would not be
suitable as an index, since the index should have distinct values.
The average score on this problem was 52%.
Assume you have a DataFrame named ikea
that contains
information about IKEA products, including columns called
'product'
(str
): the name of the product,
'assembly_cost'
(int
): the assembly cost of
each product, and 'packages'
(int
): the number
of packages each product comes in. Complete the expression below so that
it evaluates to the name of the product for which the average assembly
cost per package is lowest.
= ___(a)___)
(ikea.assign(assembly_per_package ='assembly_per_package').___(b)___) .sort_values(by
What goes in blank (a)?
Answer:
ikea.get('assembly_cost')/ikea.get('packages')
This column, as its name suggests, contains the average assembly cost per package, obtained by dividing the total cost of each product by the number of packages that product comes in. This code uses the fact that arithmetic operations between two Series happens element-wise.
The average score on this problem was 91%.
What goes in blank (b)?
Answer: get('product').iloc[0]
After adding the 'assembly_per_package'
column and
sorting by that column in the default ascending order, the product with
the lowest 'assembly_per_package'
will be in the very first
row. To access the name of that product, we need to get
the
column containing product names and use iloc
to access an
element of that Series by integer position.
The average score on this problem was 66%.
For this question, we will work with a dataset consisting of various
skyscrapers in the US, which we’ve loaded into a DataFrame called
sky
. The first few rows of sky
are shown below
(though the full DataFrame has more rows):
Each row of sky
corresponds to a single skyscraper. For
each skyscraper, we have:
its name, which is stored in the index of sky
(string)
the 'material'
it is made up of (string)
the 'city'
in the US where it is located
(string)
the number of 'floors'
(levels) it contains
(int)
its 'height'
in meters (float), and
the 'year'
in which it was opened (int)
Note that the height of a floor may be different in each building.
Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.
'height') sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: DataFrame
sky
is a DataFrame. All the sort_values
method does is change the order of the rows in the Series/DataFrame it
is called on, it does not change the data structure. As such,
sky.sort_values('height')
is also a DataFrame.
The average score on this problem was 87%.
'height').get('material').loc[0] sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: error
sky.sort_values('height')
is a DataFrame, and
sky.sort_values('height').get('material')
is a Series
corresponding to the 'material'
column, sorted by
'height'
in increasing order. So far, there are no
errors.
Remember, the .loc
accessor is used to access
elements in a Series based on their index.
sky.sort_values('height').get('material').loc[0]
is asking
for the element in the
sky.sort_values('height').get('material')
Series with index
0. However, the index of sky
is made up of building names.
Since there is no building named 0
, .loc[0]
causes an error.
The average score on this problem was 79%.
'height').get('material').iloc[0] sky.sort_values(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: string
As we mentioned above,
sky.sort_values('height').get('material')
is a Series
containing values from the 'material'
column (but sorted).
Remember, there is no element in this Series with an index of 0, so
sky.sort_values('height').get('material').loc[0]
errors.
However, .iloc[0]
works differently than
.loc[0]
; .iloc[0]
will give us the first
element in a Series (independent of what’s in the index). So,
sky.sort_values('height').get('material').iloc[0]
gives us
back a value from the 'material'
column, which is made up
of strings, so it gives us a string. (Specifically, it gives us the
'material'
type of the skyscraper with the smallest
'height'
.)
The average score on this problem was 89%.
'floors').max() sky.get(
int or float
Boolean
string
array
Series
DataFrame
error
Answer: int or float
The Series sky.get('floors')
is made up of integers, and
sky.get('floors').max()
evaluates to the largest number in
the Series, which is also an integer.
The average score on this problem was 91%.
0] sky.index[
int or float
Boolean
string
array
Series
DataFrame
error
Answer: string
sky.index
contains the values
'Bayard-Condict Building'
,
'The Yacht Club at Portofino'
,
'City Investing Building'
, etc. sky.index[0]
is then 'Bayard-Condict Building'
, which is a string.
The average score on this problem was 91%.
Included is a DataFrame named sungod
that contains
information on the artists who have performed at Sun God in years past.
For each year that the festival was held, we have one row for
each artist that performed that year. The columns are:
'Year'
(int
): the year of the
festival'Artist'
(str
): the name of the
artist'Appearance_Order'
(int
): the order in
which the artist appeared in that year’s festival (1 means they came
onstage first)The rows of sungod
are arranged in no particular
order. The first few rows of sungod
are shown
below (though sungod
has many more rows
than pictured here).
Assume:
Only one artist ever appeared at a time (for example, we can’t
have two separate artists with a 'Year'
of 2015 and an
'Appearance_Order'
of 3).
An artist may appear in multiple different Sun God festivals (they could be invited back).
We have already run import babypandas as bpd
and
import numpy as np
.
Which of the following is a valid reason not to set
the index of sungod
to 'Artist'
?
Select all correct answers.
Two different artists have the same name.
An artist performed at Sun God in more than one year.
Several different artists performed at Sun God in the same year.
Many different artists share the same value of
'Appearance_Order'
.
None of the above.
Answer: Two different artists have the same name., An artist performed at Sun God in more than one year.
For this question, it is crucial to know that an index should not
contain duplicate values, so we need to consider reasons why
'Artist'
might contain two values that are the same. Let’s
go through the answer choices in order.
For the first option, if two different artists had the same name,
this would lead to duplicate values in the 'Artist'
column.
Therefore, this is a valid reson not to index sungod
by
'Artist'
.
For the second option, if one artist performed at Sun God in more
than one year, their name would appear multiple times in the
'Artist'
column, once for each year they performed. This
would also be a valid reason not to index sungod
by
'Artist'
.
For the third option, if several different artists performed at Sun
God in the same year, that would not necessarily create duplicates in
the 'Artist'
column, unless of course two of the artists
had the same name, which we’ve already addressed in the first answer
choice. This is not a valid reason to avoid indexing sungod
by 'Artist'
.
For the last answer choice, if many different artists share the same
value of 'Appearance_Order'
, this would not create
duplicates in the 'Artist'
column. Therefore, this is also
not a valid reason to avoid indexing sungod
by
'Artist'
.
The average score on this problem was 83%.
Suppose in a new cell, we type the following.
='Year') sungod.sort_values(by
After we run that cell, we type the following in a second cell.
'Artist').iloc[0] sungod.get(
What is the output when we run the second cell? Note that the first Sun God festival was held in 1983.
'Blues Traveler'
The artist who appeared on stage first in 1983.
An artist who appeared in 1983, but not necessarily the one who appeared first.
Not enough information to tell.
Answer: 'Blues Traveler'
In the first cell, although we seem to be sorting sungod
by 'Year'
, we aren’t actually changing the DataFrame
sungod
at all because we don’t save the sorted DataFrame.
Remember that DataFrame methods don’t actually change the underlying
DataFrame unless you explicitly make that happen by saving the output as
the name of the DataFrame. So the first 'Artist'
name will
still be 'Blues Traveler'
.
Suppose we had saved the sorted DataFrame as in the code below.
= sungod.sort_values(by='Year')
sungod 'Artist').iloc[0] sungod.get(
In this case, the output would be the name of an artist who appeared in 1983, but not necessarily the one who appeared first. There will be several artists associated with the year 1983, and we don’t know which of them will be first in the sorted DataFrame.
The average score on this problem was 12%.
Write one line of code below to create a DataFrame called
openers
containing the artists that appeared first on stage
at a past Sun God festival. The DataFrame openers
should
have all the same columns as sungod
.
Answer:
openers = sungod[sungod.get('Appearance_Order')==1]
Since we want only certain rows of sungod
, we need to
query. The condition to satisfy is that the ‘Appearance_Order’ column
should have a value of 1 to indicate that this artist performed first in
a certain year’s festival.
The average score on this problem was 84%.
Suppose you are given a DataFrame of employees for a given company.
The DataFrame, called employees
, is indexed by
'employee_id'
(string) with a column called
'years'
(int) that contains the number of years each
employee has worked for the company.
Suppose that the code
='years', ascending=False).index[0] employees.sort_values(by
outputs '2476'
.
True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.
True
False
Answer: False
This is false because there could be other employees who worked at the company equally long as employee 2476.
The code says that when the employees
DataFrame is
sorted in descending order of 'years'
, employee 2476 is in
the first row. There might, however, be a tie among several employees
for their value of 'years'
. In that case, employee 2476 may
wind up in the first row of the sorted DataFrame, but we cannot say that
the number of years employee 2476 has worked for the company is greater
than the number of years that any other employee has worked for the
company.
If the statement had said greater than or equal to instead of greater than, the statement would have been true.
The average score on this problem was 29%.
What will be the output of the following code?
=2021-employees.get('years'))
employees.assign(start='start').index.iloc[-1] employees.sort_values(by
the employee id of an employee who has worked there for the most years
the employee id of an employee who has worked there for the fewest years
an error message complaining about iloc[-1]
an error message complaining about something else
Answer: an error message complaining about something else
The problem is that the first line of code does not actually add a
new column to the employees
DataFrame because the
expression is not saved. So the second line tries to sort by a column,
'start'
, that doesn’t exist in the employees
DataFrame and runs into an error when it can’t find a column by that
name.
This code also has a problem with iloc[-1]
, since
iloc
cannot be used on the index, but since the problem
with the missing 'start'
column is encountered first, that
will be the error message displayed.
The average score on this problem was 27%.