← return to practice.dsc10.com
Below are practice problems tagged for Lecture 2 (rendered directly from the original exam/quiz sources).
Nate’s favorite number is 5. He calls a number “lucky” if it’s
greater than 500 or if it contains a 5 anywhere in its representation.
For example, 1000.04 and 5.23 are both lucky
numbers.
Complete the implementation of the function check_lucky,
which takes in a number as a float and returns True if it
is lucky and False otherwise. Then, add a column named
"is_lucky" to txn that contains
True for lucky transaction amounts and False
for all other transaction amounts, and save the resulting DataFrame to
the variable luck.
def check_lucky(x):
return __(a)__
luck = txn.assign(is_lucky = __(b)__)What goes in blank (a)?
What goes in blank (b)?
Answer: (a):
x > 500 or "5" in str(x), (b):
txn.get("amount").apply(check_lucky)
(a): We want this function to return True if the number
is lucky (greater than 500 or if it has a 5 in it). Checking the first
condition is easy, we can simply use x > 500. To check the second
condition, we’ll convert the number to a string so that we can check
whether it contains "5" using the in keyword.
Once we have these two conditions written out, we can combine them with
the or keyword, since either one is enough for the number
to be considered lucky. This gives us the full statement
x > 500 or "5" in str(x). Since this will evaluate to
True if and only if the number is lucky, this is all we
need in the return statement.
The average score on this problem was 51%.
(b): Now that we have the is_lucky function, we want to
use it to find if each number in the amount column is lucky
or not. To do this, we can use .apply() to apply the
function elementwise (row-by-row) to the "amount" column,
which will create a new Series of Booleans indicating if each element in
the "amount" column is lucky.
The average score on this problem was 86%.
Fill in the blanks below so that lucky_prop evaluates to
the proportion of fraudulent "visa" card transactions whose
transaction amounts are lucky.
visa_fraud = __(a)__
lucky_prop = visa_fraud.__(b)__.mean()What goes in blank (a)?
What goes in blank (b)?
Answer: (a):
luck[(luck.get("card")=="visa") & (luck.get("is_fraud"))],
(b): get("is_lucky")
(a): The first step in this question is to query the DataFrame so
that we have only the rows which are fraudulent transactions from “visa”
cards. luck.get("card")=="visa" evaluates to
True if and only if the transaction was from a Visa card,
so this is the first part of our condition. To find transactions which
were fraudulent, we can simply find the rows with a value of
True in the "is_fraud" column. We can do this
with luck.get("is_fraud"), which is equivalent to
luck.get("is_fraud") == True in this case since the
"is_fraud" column only contains Trues and Falses. Since we
want only the rows where both of these conditions hold, we can combine
these two conditions with the logical & operator, and
place this inside of square brackets to query the luck DataFrame for
only the rows where both conditions are true, giving us
luck[(luck.get("card")=="visa") & (luck.get("is fraud")].
Note that we use the & instead of the keyword
and since & is used for elementwise
comparisons between two Series, like we’re doing here, whereas the
and keyword is used for comparing two Booleans (not two
Series containing Booleans).
The average score on this problem was 52%.
(b): We already have a Boolean column is_lucky
indicating if each transaction had a lucky amount. Recall that booleans
are equivalent to 1s and 0s, where 1 represents true and 0 represents
false, so to find the proportion of lucky amounts we can simply take the
mean of the is_lucky column. The reason that taking the mean is
equivalent to finding the proportion of lucky amounts comes from the
definition of the mean: the sum of all values divided by the number of
entries. If all entries are ones and zeros, then summing the values is
equivalent to counting the number of ones (Trues) in the Series.
Therefore, the mean will be given by the number of Trues divided by the
length of the Series, which is exactly the proportion of lucky numbers
in the column.
The average score on this problem was 61%.
Fill in the blanks below so that lucky_prop is one value
in the Series many_props.
many_props = luck.groupby(__(a)__).mean().get(__(b)__)What goes in blank (a)?
What goes in blank (b)?
Answer: (a): [""card"", "is_fraud"],
(b): "is_lucky"
(a): lucky_prop is the proportion of fraudulent “visa”
card transactions that have a lucky amount. The idea is to create a
Series with the proportions of fraudulent or non-fraudulent transactions
from each card type that have a lucky amount. To do this, we’ll want to
group by the column that describes the card type ("card"),
and the column that describes whether a transaction is fraudulent
("is_fraud"). Putting this in the proper syntax for a
groupby with multiple columns, we have
["card", "is_fraud"]. The order doesn’t matter, so
["is_fraud", ""card""] is also correct.
The average score on this problem was 55%.
(b): Once we have this grouped DataFrame, we know that the entry in
each column will be the mean of that column for some combination of
"is_fraud" and "method". And, since
"is_lucky" contains Booleans, we know that this mean is
equivalent to the proportion of transactions which were lucky for each
"is_fraud" and "method" combination. One such
combination is fraudulent “visa” transactions, so
lucky_prop is one element of this Series.
The average score on this problem was 67%.
Notice that bookstore has an index of
"ISBN" and sales does not. Why is that?
There is no good reason. We could have set the index of
sales to "ISBN".
There can be two different books with the same
"ISBN".
"ISBN" is already being used as the index of
bookstore, so it shouldn’t also be used as the index of
sales.
The bookstore can sell multiple copies of the same book.
Answer: The bookstore can sell multiple copies of the same book.
In the sales DataFrame, each row represents an
individual sale, meaning multiple rows can have the same
"ISBN" if multiple copies of the same book are sold.
Therefore we can’t use it as the index because it is not a unique
identifier for rows of sales.
The average score on this problem was 87%.
Is "ISBN" a numerical or categorical variable?
numerical
categorical
Answer: categorical
Even though "ISBN" consists of numbers, it is used to
identify and categorize books rather than to quantify or measure
anything, thus it is categorical. It doesn’t make sense to compare ISBN
numbers like you would compare numbers on a number line, or to do
arithmetic with ISBN numbers.
The average score on this problem was 75%.
Which type of data visualization should be used to compare authors by median rating?
scatter plot
line plot
bar chart
histogram
Answer: bar chart
A bar chart is best, as it visualizes numerical values (median ratings) across discrete categories (authors).
The average score on this problem was 88%.
For each expression below, determine the data type of the output and the value of the expression, if possible. If there is not enough information to determine the expression’s value, write “Unknown” in the corresponding blank.
apts.get("Rent").iloc[43] * 4 / 2
Answer:
We know that all values in the column Rent are
ints. So, when we call .iloc[43] on this
column (which grabs the 44th entry in the column), we know the result
will be an int. We then perform some multiplication and
division with this value. Importantly, when we divide an
int, the type is automatically changed to a
float, so the type of the final output will be a
float. Since we do not explicitly know what the 44th entry
in the Rent column is, the exact value of this
float is unknown to us.
The average score on this problem was 77%.
apts.get("Neighborhood").iloc[2][-3]
Answer:
This code takes the third entry (the entry at index 2) from the
Neighborhood column of apts, which is a
str, and it takes the third to last letter of that string.
The third entry in the Neighborhood column is
'Midway', and the third to last letter of
'Midway' is 'w'. So, our result is a
string with value w.
The average score on this problem was 73%.
(apts.get("Laundry") + 5).max()
Answer:
This code deals with the Laundry column of
apts, which is a Series of Trues and
Falses. One property of Trues and
Falses is that they are also interpreted by Python as ones
and zeroes. So, the code (apts.get("Laundry") + 5).max()
adds five to each of the ones and zeroes in this column, and then takes
the maximum value from the column, which would be an int of
value 6.
The average score on this problem was 69%.
apts.get("Complex").str.contains("Verde")
Answer:
This code takes the column (series) "Complex" and
returns a new series of True and False values.
Each True in the new column is a result of an entry in the
"Complex" column containing "Verde". Each
False in the new column is a result of an entry in the
"Complex" column failing to contain "Verde".
Since we are not given the entirety of the "Complex"
column, the exact value of the resulting series is unknown to us.
The average score on this problem was 64%.
apts.get("Sqft").median() > 1000
Answer:
This code finds the median of the column (series) "Sqft"
and compares it to a value of 1000, resulting in a bool
value of True or False. Since we do not know
the median of the "Sqft" column, the exact value of the
resulting code is unknown to us.
The average score on this problem was 87%.