← return to practice.dsc10.com
These problems are taken from past quizzes and exams. Work on them
on paper, since the quizzes and exams you take in this
course will also be on paper.
We encourage you to complete these
problems during discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all of
these problems during the discussion section; the problems we don’t
cover can be used for extra practice.
You have a DataFrame called prices
that contains
information about food prices at 18 different grocery stores. There is
column called 'broccoli'
that contains the price in dollars
for one pound of broccoli at each grocery store. There is also a column
called 'ice_cream'
that contains the price in dollars for a
pint of store-brand ice cream.
Using the code,
='hist', y='broccoli', bins=np.arange(0.8, 2.11, 0.1), density=True) prices.plot(kind
we produced the histogram below:
How many grocery stores sold broccoli for a price greater than or equal to $1.30 per pound, but less than $1.40 per pound (the tallest bar)?
Answer: 4 grocery stores
We are given that the bins start at 0.8 and have a width of 0.1, which means one of the bins has endpoints 1.3 and 1.4. This bin (the tallest bar) includes all grocery stores that sold broccoli for a price greater than or equal to $1.30 per pound, but less than $1.40 per pound.
This bar has a width of 0.1 and we’d estimate the height to be around 2.2, though we can’t say exactly. Multiplying these values, the area of the bar is about 0.22, which means about 22 percent of the grocery stores fall into this bin. There are 18 grocery stores in total, as we are told in the introduction to this question. We can compute using a calculator that 22 percent of 18 is 3.96. Since the actual number of grocery stores this represents must be a whole number, this bin must represent 4 grocery stores.
The reason for the slight discrepancy between 3.96 and 4 is that we used 2.2 for the height of the bar, a number that we determined by eye. We don’t know the exact height of the bar. It is reassuring to do the calculation and get a value that’s very close to an integer, since we know the final answer must be an integer.
The average score on this problem was 71%.
Suppose we now plot the same data with different bins, using the following line of code:
='hist', y='broccoli', bins=[0.8, 1, 1.1, 1.5, 1.8, 1.9, 2.5], density=True) prices.plot(kind
What would be the height on the y-axis for the bin corresponding to the interval [\$1.10, \$1.50)? Input your answer below.
Answer: 1.25
First, we need to figure out how many grocery stores the bin [\$1.10, \$1.50) contains. We already know from the previous subpart that there are four grocery stores in the bin [\$1.30, \$1.40). We could do similar calculations to find the number of grocery stores in each of these bins:
However, it’s much simpler and faster to use the fact that when the bins are all equally wide, the height of a bar is proportional to the number of data values it contains. So looking at the histogram in the previous subpart, since we know the [\$1.30, \$1.40) bin contains 4 grocery stores, then the [\$1.10, \$1.20) bin must contain 1 grocery store, since it’s only a quarter as tall. Again, we’re taking advantage of the fact that there must be an integer number of grocery stores in each bin when we say it’s 1/4 as tall. Our only options are 1/4, 1/2, or 3/4 as tall, and among those choices, it’s clear.
Therefore, by looking at the relative heights of the bars, we can quickly determine the number of grocery stores in each bin:
Adding these numbers together, this means there are 9 grocery stores whose broccoli prices fall in the interval [\$1.10, \$1.50). In the new histogram, these 9 grocery stores will be represented by a bar of width 1.50-1.10 = 0.4. The area of the bar should be \frac{9}{18} = 0.5. Therefore the height must be \frac{0.5}{0.4} = 1.25.
The average score on this problem was 33%.
You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?
'broccoli', 'ice_cream']).plot(kind='barh') prices.get([
Yes
No
Answer: Yes
When we use .plot
without specifying a y
column, it uses every column in the DataFrame as a y
column
and creates an overlaid plot. Since we first use get
with
the list ['broccoli', 'ice_cream']
, this keeps the
'broccoli'
and 'ice_cream'
columns from
prices
, so our bar chart will overlay broccoli prices with
ice cream prices. Notice that this get
is unnecessary
because prices
only has these two columns, so it would have
been the same to just use prices
directly. The resulting
bar chart will look something like this:
Each grocery store has its broccoli price represented by the length of the blue bar and its ice cream price represented by the length of the red bar. We can therefore answer the question by simply counting the number of red bars that are shorter than their corresponding blue bars.
The average score on this problem was 78%.
You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?
'broccoli', 'ice_cream']).plot(kind='hist') prices.get([
Yes
No
Answer: No
This will create an overlaid histogram of broccoli prices and ice cream prices. So we will be able to see the distribution of broccoli prices together with the distribution of ice cream prices, but we won’t be able to pair up particular broccoli prices with ice cream prices at the same store. This means we won’t be able to answer the question. The overlaid histogram would look something like this:
This tells us that broadly, ice cream tends to be more expensive than broccoli, but we can’t say anything about the number of stores where ice cream is cheaper than broccoli.
The average score on this problem was 81%.
Some code and the scatterplot that produced it is shown below:
'broccoli', 'ice_cream']).plot(kind='scatter', x='broccoli', y='ice_cream')) (prices.get([
Can you use this plot to figure out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli?
If so, say how many such stores there are and explain how you came to that conclusion.
If not, explain why this scatterplot cannot be used to answer the question.
Answer: Yes, and there are 2 such stores.
In this scatterplot, each grocery store is represented as one dot. The x-coordinate of that dot tells the price of broccoli at that store, and the y-coordinate tells the price of ice cream. If a grocery store’s ice cream price is cheaper than its broccoli price, the dot in the scatterplot will have y<x. To identify such dots in the scatterplot, imagine drawing the line y=x. Any dot below this line corresponds to a point with y<x, which is a grocery store where ice cream is cheaper than broccoli. As we can see, there are two such stores.
The average score on this problem was 78%.
Included is a DataFrame named sungod
that contains
information on the artists who have performed at Sun God in years past.
For each year that the festival was held, we have one row for
each artist that performed that year. The columns are:
'Year'
(int
): the year of the
festival'Artist'
(str
): the name of the
artist'Appearance_Order'
(int
): the order in
which the artist appeared in that year’s festival (1 means they came
onstage first)The rows of sungod
are arranged in no particular
order. The first few rows of sungod
are shown
below (though sungod
has many more rows
than pictured here).
Assume:
Only one artist ever appeared at a time (for example, we can’t
have two separate artists with a 'Year'
of 2015 and an
'Appearance_Order'
of 3).
An artist may appear in multiple different Sun God festivals (they could be invited back).
We have already run import babypandas as bpd
and
import numpy as np
.
On the graph paper below, draw the histogram that would be produced by this code.
(5))
sungod.take(np.arange(='hist', density=True,
.plot(kind=np.arange(0, 7, 2), y='Appearance_Order');
bins )
In your drawing, make sure to label the height of each bar in the histogram on the vertical axis. You can scale the axes however you like, and the two axes don’t need to be on the same scale.
Answer:
To draw the histogram, we first need to bin the data and figure out
how many data values fall into each bin. The code includes
bins=np.arange(0, 7, 2)
which means the bin endpoints are
0, 2, 4, 6. This gives us three bins:
[0, 2), [2,
4), and [4, 6]. Remember that
all bins, except for the last one, include the left endpoint but not the
right. The last bin includes both endpoints.
Now that we know what the bins are, we can count up the number of
values in each bin. We need to look at the
'Appearance_Order'
column of
sungod.take(np.arange(5))
, or the first five rows of
sungod
. The values there are 1,
4, 3, 1, 3. The two 1s fall into
the first bin [0, 2). The two 3s fall into the second bin [2, 4), and the one 4 falls into the last bin [4, 6]. This means the proportion of values
in each bin are \frac{2}{5}, \frac{2}{5},
\frac{1}{5} from left to right.
To figure out the height of each bar in the histogram, we use the fact that the area of a bar in a density histogram should equal the proportion of values in that bin. The area of a rectangle is height times width, so height is area divided by width.
For the bin [0, 2), the area is \frac{2}{5} = 0.4 and the width is 2, so the height is \frac{0.4}{2} = 0.2.
For the bin [2, 4), the area is \frac{2}{5} = 0.4 and the width is 2, so the height is \frac{0.4}{2} = 0.2.
For the bin [4, 6], the area is \frac{1}{5} = 0.2 and the width is 2, so the height is \frac{0.2}{2} = 0.1.
Since the bins are all the same width, the fact that there an equal number of values in the first two bins and half as many in the third bin means the first two bars should be equally tall and the third should be half as tall. We can use this to draw the rest of the histogram quickly once we’ve drawn the first bar.
The average score on this problem was 45%.
Nintendo collected data on the heights of a sample of Animal Crossing: New Horizons players. A histogram of the heights in their sample is given below.
What percentage of players in Nintendo’s sample are at least 62.5 inches tall? Give your answer as an integer rounded to the nearest multiple of 5.
Answer: 80%
The average score on this problem was 73%.
Suppose there are 200 students enrolled in DSC 10, and that the histogram below displays the distribution of the number of Instagram followers each student has, measured in 100s. That is, if a student is represented in the first bin, they have between 0 and 200 Instagram followers.
How many students in DSC 10 have between 200 and 800 Instagram followers? Give your answer as an integer.
Answer: 90
Remember, the key property of histograms is that the proportion of values in a bin is equal to the area of the corresponding bar. To find the number of values in the range 2-8 (the x-axis is measured in hundreds), we’ll need to find the proportion of values in the range 2-8 and multiply that by 200, which is the total number of students in DSC 10. To find the proportion of values in the range 2-8, we’ll need to find the areas of the 2-4, 4-6, and 6-8 bars.
Area of the 2-4 bar: \text{width} \cdot \text{height} = 2 \cdot 0.1 = 0.2
Area of the 4-6 bar: \text{width} \cdot \text{height} = 2 \cdot 0.0625 = 0.125.
Area of the 6-8 bar: \text{width} \cdot \text{height} = 2 \cdot 0.0625 = 0.125.
Then, the total proportion of values in the range 2-8 is 0.2 + 0.125 + 0.125 = 0.45, so the total number of students with between 200 and 800 Instagram followers is 0.45 \cdot 200 = 90.
The average score on this problem was 49%.
Suppose the height of a bar in the above histogram is h. How many students are represented in the corresponding bin, in terms of h?
Hint: Just as in the first subpart, you’ll need to use the assumption from the start of the problem.
20 \cdot h
100 \cdot h
200 \cdot h
400 \cdot h
800 \cdot h
Answer: 400 \cdot h
As we said at the start of the last solution, the key property of histograms is that the proportion of values in a bin is equal to the area of the corresponding bar. Then, the number of students represented bar a bar is the total number of students in DSC 10 (200) multiplied by the area of the bar.
Since all bars in this histogram have a width of 2, the area of a bar in this histogram is \text{width} \cdot \text{height} = 2 \cdot h. If there are 200 students in total, then the number of students represented in a bar with height h is 200 \cdot 2 \cdot h = 400 \cdot h.
To verify our answer, we can check to see if it makes sense in the context of the previous subpart. The 2-4 bin has a height of 0.1, and 400 \cdot 0.1 = 40. The total number of students in the range 2-8 was 90, so it makes sense that 40 of them came from the 2-4 bar, since the 2-4 bar takes up about half of the area of the 2-8 range.
The average score on this problem was 36%.
The DataFrame uc
has information about the number of
degrees awarded by each University of California campus, for each of the
past six academic years "2018-2019"
through
"2023-2024"
. Since there are ten UC campuses and six years
of data, there are 60 rows. The columns are "Campus" (str)
,
"Year" (str)
, and "Degrees" (int)
.
Typically, graduation ceremonies happen only at the end of an academic year. For example, students who earn a degree in the 2024-2025 academic year celebrate with a graduation ceremony in 2025.
The function ceremony_year
takes as input a string
formatted like those in the "Year"
column of , and
returns an int
corresponding to the year of the
graduation ceremony for that academic year. For example,
ceremony_year("2024-2025")
should return 2025
.
Fill in the return statement of this function below.
def ceremony_year(academic_year):
return
Answer:
int(academic_year.split("-")[1])
The average score on this problem was 70%.
What does the following expression evaluate to? Write your answer exactly how the output would appear in Python.
uc.get("Year").apply(ceremony_year).min()
Answer: 2019
The average score on this problem was 87%.
In the ikea
DataFrame, the first word of each string in
the 'product'
column represents the product line. For
example the HEMNES line of products includes several different products,
such as beds, dressers, and bedside tables.
The code below assigns a new column to the ikea
DataFrame containing the product line associated with each product.
= ikea.get('product')
(ikea.assign(product_line apply(extract_product_line))) .
What are the input and output types of the
extract_product_line
function?
takes a string as input, returns a string
takes a string as input, returns a Series
takes a Series as input, returns a string
takes a Series as input, returns a Series
Answer: takes a string as input, returns a string
To use the Series method .apply
, we first need a Series,
containing values of any type. We pass in the name of a function to
.apply
and essentially, .apply
calls the given
function on each value of the Series, producing a Series with the
resulting outputs of those function calls. In this case,
.apply(extract_product_line)
is called on the Series
ikea.get('product')
, which contains string values. This
means the function extract_product_line
must take strings
as inputs. We’re told that the code assigns a new column to the
ikea
DataFrame containing the product line associated with
each product, and we know that the product line is a string, as it’s the
first word of the product name. This means the function
extract_product_line
must output a string.
The average score on this problem was 72%.
Complete the return statement in the
extract_product_line
function below.
For example,
extract_product_line('HEMNES Daybed frame with 3 drawers, white, Twin')
should return 'HEMNES'
.
def extract_product_line(x):
return _________
What goes in the blank?
Answer: x.split(' ')[0]
This function should take as input a string x
,
representing a product name, and return the first word of that string,
representing the product line. Since words are separated by spaces, we
want to split the string on the space character ' '
.
It’s also correct to answer x.split()[0]
without
specifying to split on spaces, because the default behavior of the
string .split
method is to split on any whitespace, which
includes any number of spaces, tabs, newlines, etc. Since we’re only
extracting the first word, which will be separated from the rest of the
product name by a single space, it’s equivalent to split using single
spaces and using the default of any whitespace.
The average score on this problem was 84%.
Complete the implementation of the to_minutes
function
below. This function takes as input a string formatted as
'x hr, y min'
where x
and y
represent integers, and returns the corresponding number of minutes,
as an integer (type int
in Python).
For example, to_minutes('3 hr, 5 min')
should return
185.
def to_minutes(time):
= time.split(' hr, ')
first_split = first_split[1].split(' min')
second_split return _________
What goes in the blank?
Answer:
int(first_split[0])*60+int(second_split[0])
As the last subpart demonstrated, if we want to compare times, it
doesn’t make sense to do so when times are represented as strings. In
the to_minutes
function, we convert a time string into an
integer number of minutes.
The first step is to understand the logic. Every hour contains 60
minutes, so for a time string formatted like x hr, y min'
the total number of minutes comes from multiplying the value of
x
by 60 and adding y
.
The second step is to understand how to extract the x
and y
values from the time string using the string methods
.split
. The string method .split
takes as
input some separator string and breaks the string into pieces at each
instance of the separator string. It then returns a list of all those
pieces. The first line of code, therefore, creates a list called
first_split
containing two elements. The first element,
accessed by first_split[0]
contains the part of the time
string that comes before ' hr, '
. That is,
first_split[0]
evaluates to the string x
.
Similarly, first_split[1]
contains the part of the time
string that comes after ' hr, '
. So it is formatted like
'y min'
. If we split this string again using the separator
of ' min'
, the result will be a list whose first element is
the string 'y'
. This list is saved as
second_split
so second_split[0]
evaluates to
the string y
.
Now we have the pieces we need to compute the number of minutes,
using the idea of multiplying the value of x
by 60 and
adding y
. We have to be careful with data types here, as
the bolded instructions warn us that the function must return an
integer. Right now, first_split[0]
evaluates to the string
x
and second_split[0]
evaluates to the string
y
. We need to convert these strings to integers before we
can multiply and add. Once we convert using the int
function, then we can multiply the number of hours by 60 and add the
number of minutes. Therefore, the solution is
int(first_split[0])*60+int(second_split[0])
.
Note that failure to convert strings to integers using the
int
function would lead to very different behavior. Let’s
take the example time string of '3 hr, 5 min'
as input to
our function. With the return statement as
int(first_split[0])*60+int(second_split[0])
, the function
would return 185 on this input, as desired. With the return statement as
first_split[0]*60+second_split[0]
, the function would
return a string of length 61, looking something like this
'3333...33335'
. That’s because the *
and
+
symbols do have meaning for strings, they’re just
different meanings than when used with integers.
The average score on this problem was 71%.
An art museum records information about its collection in a DataFrame
called art
. The columns of art
are as
follows:
"title" (str)
: the name of the art piece."artist" (str)
: the name of the artist."year" (int)
: the year the art piece was produced."price" (float)
: the selling price of the art piece in
dollars Which of the following correctly plots a density histogram showing
the distribution of "price"
in art
? Select all
that apply.
art.get(["price"]).plot(kind="hist", density=True)
art.get(["price"]).plot(kind="hist")
art.drop(columns=["artist", "year", "price"]).plot(kind="hist", density=True)
art.plot(kind="hist", y="price")
art.plot(kind="hist", y="price", density=True)
art.plot(kind="hist", x="price", density=True)
Answer: Options 1 and 5
The average score on this problem was 83%.
The density histogram below shows the distribution of
"price"
in art
. If the museum has 100 art
pieces in total, how many pieces cost at least $3,000 but less than
$4,500?
Answer: 30
The average score on this problem was 55%.
The laptops
DataFrame contains information on various
factors that influence the pricing of laptops. Each row represents a
laptop, and the columns are:
"Mfr" (str)
: the company that manufactures the laptop,
like “Apple” or “Dell”."Model" (str)
: the model name of the laptop, such as
“MacBook Pro”."OS" (str)
: the operating system, such as “macOS” or
“Windows 11”."Screen Size" (float)
: the diagonal length of the
screen, in inches."Price" (float)
: the price of the laptop, in dollars.
A density histogram of "Screen Size"
is shown below.
What percentage of computer models have a
"Screen Size"
of at least 16 inches but less than 18
inches? Give your answer as an integer rounded to the nearest
multiple of 5.
Answer 35%
The average score on this problem was 72%.