Lecture 2 — Practice

Below are practice problems tagged for Lecture 2 (rendered directly from the original exam/quiz sources).

Problem 1

Problem 1.1

Nate’s favorite number is 5. He calls a number “lucky” if it’s greater than 500 or if it contains a 5 anywhere in its representation. For example, 1000.04 and 5.23 are both lucky numbers.

Complete the implementation of the function check_lucky, which takes in a number as a float and returns True if it is lucky and False otherwise. Then, add a column named "is_lucky" to txn that contains True for lucky transaction amounts and False for all other transaction amounts, and save the resulting DataFrame to the variable luck.

Answer: (a): x > 500 or "5" in str(x), (b): txn.get("amount").apply(check_lucky)

(a): We want this function to return True if the number is lucky (greater than 500 or if it has a 5 in it). Checking the first condition is easy, we can simply use x > 500. To check the second condition, we’ll convert the number to a string so that we can check whether it contains "5" using the in keyword. Once we have these two conditions written out, we can combine them with the or keyword, since either one is enough for the number to be considered lucky. This gives us the full statement x > 500 or "5" in str(x). Since this will evaluate to True if and only if the number is lucky, this is all we need in the return statement.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.

(b): Now that we have the is_lucky function, we want to use it to find if each number in the amount column is lucky or not. To do this, we can use .apply() to apply the function elementwise (row-by-row) to the "amount" column, which will create a new Series of Booleans indicating if each element in the "amount" column is lucky.

Difficulty: ⭐️⭐️

The average score on this problem was 86%.

Problem 1.2

Fill in the blanks below so that lucky_prop evaluates to the proportion of fraudulent "visa" card transactions whose transaction amounts are lucky.

Answer: (a): luck[(luck.get("card")=="visa") & (luck.get("is_fraud"))], (b): get("is_lucky")

(a): The first step in this question is to query the DataFrame so that we have only the rows which are fraudulent transactions from “visa” cards. luck.get("card")=="visa" evaluates to True if and only if the transaction was from a Visa card, so this is the first part of our condition. To find transactions which were fraudulent, we can simply find the rows with a value of True in the "is_fraud" column. We can do this with luck.get("is_fraud"), which is equivalent to luck.get("is_fraud") == True in this case since the "is_fraud" column only contains Trues and Falses. Since we want only the rows where both of these conditions hold, we can combine these two conditions with the logical & operator, and place this inside of square brackets to query the luck DataFrame for only the rows where both conditions are true, giving us luck[(luck.get("card")=="visa") & (luck.get("is fraud")]. Note that we use the & instead of the keyword and since & is used for elementwise comparisons between two Series, like we’re doing here, whereas the and keyword is used for comparing two Booleans (not two Series containing Booleans).

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.

(b): We already have a Boolean column is_lucky indicating if each transaction had a lucky amount. Recall that booleans are equivalent to 1s and 0s, where 1 represents true and 0 represents false, so to find the proportion of lucky amounts we can simply take the mean of the is_lucky column. The reason that taking the mean is equivalent to finding the proportion of lucky amounts comes from the definition of the mean: the sum of all values divided by the number of entries. If all entries are ones and zeros, then summing the values is equivalent to counting the number of ones (Trues) in the Series. Therefore, the mean will be given by the number of Trues divided by the length of the Series, which is exactly the proportion of lucky numbers in the column.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.

Problem 1.3

Fill in the blanks below so that lucky_prop is one value in the Series many_props.

Answer: (a): [""card"", "is_fraud"], (b): "is_lucky"

(a): lucky_prop is the proportion of fraudulent “visa” card transactions that have a lucky amount. The idea is to create a Series with the proportions of fraudulent or non-fraudulent transactions from each card type that have a lucky amount. To do this, we’ll want to group by the column that describes the card type ("card"), and the column that describes whether a transaction is fraudulent ("is_fraud"). Putting this in the proper syntax for a groupby with multiple columns, we have ["card", "is_fraud"]. The order doesn’t matter, so ["is_fraud", ""card""] is also correct.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 55%.

(b): Once we have this grouped DataFrame, we know that the entry in each column will be the mean of that column for some combination of "is_fraud" and "method". And, since "is_lucky" contains Booleans, we know that this mean is equivalent to the proportion of transactions which were lucky for each "is_fraud" and "method" combination. One such combination is fraudulent “visa” transactions, so lucky_prop is one element of this Series.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.

Problem 2

Problem 2.1

Notice that bookstore has an index of "ISBN" and sales does not. Why is that?

Answer: The bookstore can sell multiple copies of the same book.

In the sales DataFrame, each row represents an individual sale, meaning multiple rows can have the same "ISBN" if multiple copies of the same book are sold. Therefore we can’t use it as the index because it is not a unique identifier for rows of sales.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 2.2

Answer: categorical

Even though "ISBN" consists of numbers, it is used to identify and categorize books rather than to quantify or measure anything, thus it is categorical. It doesn’t make sense to compare ISBN numbers like you would compare numbers on a number line, or to do arithmetic with ISBN numbers.

Difficulty: ⭐️⭐️

The average score on this problem was 75%.

Problem 2.3

Which type of data visualization should be used to compare authors by median rating?

Answer: bar chart

A bar chart is best, as it visualizes numerical values (median ratings) across discrete categories (authors).

Difficulty: ⭐️⭐️

The average score on this problem was 88%.

Problem 3

For each expression below, determine the data type of the output and the value of the expression, if possible. If there is not enough information to determine the expression’s value, write “Unknown” in the corresponding blank.

Problem 3.1

Answer:

type: float
value: Unknown

We know that all values in the column Rent are ints. So, when we call .iloc[43] on this column (which grabs the 44th entry in the column), we know the result will be an int. We then perform some multiplication and division with this value. Importantly, when we divide an int, the type is automatically changed to a float, so the type of the final output will be a float. Since we do not explicitly know what the 44th entry in the Rent column is, the exact value of this float is unknown to us.

Difficulty: ⭐️⭐️

The average score on this problem was 77%.

Problem 3.2

Answer:

type: str
value: “w”

This code takes the third entry (the entry at index 2) from the Neighborhood column of apts, which is a str, and it takes the third to last letter of that string. The third entry in the Neighborhood column is 'Midway', and the third to last letter of 'Midway' is 'w'. So, our result is a string with value w.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

Problem 3.3

Answer:

type: int
value: 6

This code deals with the Laundry column of apts, which is a Series of Trues and Falses. One property of Trues and Falses is that they are also interpreted by Python as ones and zeroes. So, the code (apts.get("Laundry") + 5).max() adds five to each of the ones and zeroes in this column, and then takes the maximum value from the column, which would be an int of value 6.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.

Problem 3.4

Answer:

type: Series
value: Unknown

This code takes the column (series) "Complex" and returns a new series of True and False values. Each True in the new column is a result of an entry in the "Complex" column containing "Verde". Each False in the new column is a result of an entry in the "Complex" column failing to contain "Verde". Since we are not given the entirety of the "Complex" column, the exact value of the resulting series is unknown to us.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.

Problem 3.5

Answer:

type: bool
value: Unknown

This code finds the median of the column (series) "Sqft" and compares it to a value of 1000, resulting in a bool value of True or False. Since we do not know the median of the "Sqft" column, the exact value of the resulting code is unknown to us.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Lecture 2 — Collected Practice Questions

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Difficulty: ⭐️⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Difficulty: ⭐️⭐️⭐️

Problem 1.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Difficulty: ⭐️⭐️⭐️

Problem 2

Problem 2.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3

Problem 3.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3.5

Click to view the solution.

Difficulty: ⭐️⭐️