Lecture 9 — Practice

← return to practice.dsc10.com


Lecture 9 — Collected Practice Questions

Below are practice problems tagged for Lecture 9 (rendered directly from the original exam/quiz sources).


Problem 1

Suppose we create a DataFrame called socal containing only King Triton’s flights departing from SAN, LAX, or SNA (John Wayne Airport in Orange County). socal has 10 rows; the bar chart below shows how many of these 10 flights departed from each airport.


Consider the DataFrame that results from merging socal with itself, as follows:

double_merge = socal.merge(socal, left_on='FROM', right_on='FROM')

How many rows does double_merge have?

Answer: 38

There are two flights from LAX. When we merge socal with itself on the 'FROM' column, each of these flights gets paired up with each of these flights, for a total of four rows in the output. That is, the first flight from LAX gets paired with both the first and second flights from LAX. Similarly, the second flight from LAX gets paired with both the first and second flights from LAX.

Following this logic, each of the five flights from SAN gets paired with each of the five flights from SAN, for an additional 25 rows in the output. For SNA, there will be 9 rows in the output. The total is therefore 2^2 + 5^2 + 3^2 = 4 + 25 + 9 = 38 rows.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.


Problem 2

We also have a DataFrame that contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan.

Suppose we’ve run the following few lines of code.

tesla = evs[evs.get("Brand") == "Tesla"]
bmw = evs[evs.get("Brand") == "BMW"]
audi = evs[evs.get("Brand") == "Audi"]

combo = tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle")

How many rows does the DataFrame combo have?

Answer: 35

Let’s attempt this problem step-by-step. We’ll first determine the number of rows in tesla.merge(bmw, on="BodyStyle"), and then determine the number of rows in combo. For the purposes of the solution, let’s use temp to refer to the first merged DataFrame, tesla.merge(bmw, on="BodyStyle").

Recall, when we merge two DataFrames, the resulting DataFrame contains a single row for every match between the two columns, and rows in either DataFrame without a match disappear. In this problem, the column that we’re looking for matches in is "BodyStyle".

To determine the number of rows of temp, we need to determine which rows of tesla have a "BodyStyle" that matches a row in bmw. From the DataFrame provided, we can see that the only "BodyStyle"s in both tesla and bmw are SUV and sedan. When we merge tesla and bmw on "BodyStyle":

  • The 4 SUV rows in tesla each match the 1 SUV row in bmw. This will create 4 SUV rows in temp.
  • The 3 sedan rows in tesla each match the 1 sedan row in bmw. This will create 3 sedan rows in temp.

So, temp is a DataFrame with a total of 7 rows, with 4 rows for SUVs and 3 rows for sedans (in the "BodyStyle") column. Now, when we merge temp and audi on "BodyStyle":

  • The 4 SUV rows in temp each match the 8 SUV rows in audi. This will create 4 \cdot 8 = 32 SUV rows in combo.
  • The 3 sedan rows in temp each match the 1 sedan row in audi. This will create 3 \cdot 1 = 3 sedan rows in combo.

Thus, the total number of rows in combo is 32 + 3 = 35.

Note: You may notice that 35 is the result of multiplying the "SUV" and "Sedan" columns in the DataFrame provided, and adding up the results. This problem is similar to Problem 5 from the Fall 2021 Midterm).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.


Source: fa22-midterm — Q7

Problem 3

Below, we provide the same DataFrame as shown at the start of the previous problem, which contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan.

Suppose we’ve run the following few lines of code.

tesla = evs[evs.get("Brand") == "Tesla"]
bmw = evs[evs.get("Brand") == "BMW"]
audi = evs[evs.get("Brand") == "Audi"]

combo = tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle")

How many rows does the DataFrame combo have?

Answer: 35

Let’s attempt this problem step-by-step. We’ll first determine the number of rows in tesla.merge(bmw, on="BodyStyle"), and then determine the number of rows in combo. For the purposes of the solution, let’s use temp to refer to the first merged DataFrame, tesla.merge(bmw, on="BodyStyle").

Recall, when we merge two DataFrames, the resulting DataFrame contains a single row for every match between the two columns, and rows in either DataFrame without a match disappear. In this problem, the column that we’re looking for matches in is "BodyStyle".

To determine the number of rows of temp, we need to determine which rows of tesla have a "BodyStyle" that matches a row in bmw. From the DataFrame provided, we can see that the only "BodyStyle"s in both tesla and bmw are SUV and sedan. When we merge tesla and bmw on "BodyStyle":

  • The 4 SUV rows in tesla each match the 1 SUV row in bmw. This will create 4 SUV rows in temp.
  • The 3 sedan rows in tesla each match the 1 sedan row in bmw. This will create 3 sedan rows in temp.

So, temp is a DataFrame with a total of 7 rows, with 4 rows for SUVs and 3 rows for sedans (in the "BodyStyle") column. Now, when we merge temp and audi on "BodyStyle":

  • The 4 SUV rows in temp each match the 8 SUV rows in audi. This will create 4 \cdot 8 = 32 SUV rows in combo.
  • The 3 sedan rows in temp each match the 1 sedan row in audi. This will create 3 \cdot 1 = 3 sedan rows in combo.

Thus, the total number of rows in combo is 32 + 3 = 35.

Note: You may notice that 35 is the result of multiplying the "SUV" and "Sedan" columns in the DataFrame provided, and adding up the results. This problem is similar to Problem 5 from the Fall 2021 Midterm).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.


Source: fa23-final — Q2

Problem 4

Consider the DataFrame combo, defined below.

    combo = txn.groupby(["is_fraud", "method", "card"]).mean()


Problem 4.1

What is the maximum possible value of combo.shape[0]? Give your answer as an integer.

Answer: 16

combo.shape[0] will give us the number of rows of the combo DataFrame. Since we’re grouping by "is_fraud", "method", and "card", we will have one row for each unique combination of values in these columns. There are 2 possible values for "is_fraud", 2 possible values for "method", and 2 possible values for "card", so the total number of possibilities is 2 * 2 * 4 = 16. This is the maximum number possible because 16 combinations of "is_fraud", "method", and "card" are possible, but they may not all exist in the data.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 4.2

What is the value of combo.shape[1]?

Answer: 2

combo.shape[1] will give us the number of columns of the DataFrame. In this case, we’re using .mean() as our aggregation function, so the resulting DataFrame will only have columns with numeric types (since BabyPandas automatically ignores columns which have a data type incompatible with the aggregation function). In this case, "amount" and "lifetime" are the only numeric columns, so combo will have 2 columns.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.



Source: fa23-final — Q5

Problem 5

The DataFrame seven, shown below to the left, consists of a simple random sample of 7 rows from txn, with just the "is_fraud" and "amount" columns selected.

The DataFrame locations, shown below to the right, is missing some values in its "is_fraud" column.


Problem 5.1

Fill in the blanks to complete the "is_fraud" column of locations so that the DataFrame seven.merge(locations, on="is_fraud") has 19 rows.

Answer A correct answer has one True and three False rows.

We’re merging on the "is_fraud" column, so we want to look at which rows have which values for "is_fraud". There are only two possible values (True and False), and we see that there are two Trues and 5 Falses in seven. Now, think about what happens “under the hood” for this merge, and how many rows are created when it occurs. Python will match each True in seven with each True in the "is_fraud" column of location, and make a new row for each such pair. For example, since Toronto’s row in location has a True value in location, the merged DataFrame will have one row where Toronto is matched with the transaction of $34.92 and one where Toronto is matched with the transaction of $25.07. More broadly, each True in locations creates 2 rows in the merged DataFrame, and each False in locations creates 5 rows in the merged DataFrame. The question now boils down to creating 19 by summing 2s and 5s. Notice that 19 = 3\cdot5+2\cdot2. This means we can achieve the desired 19 rows by making sure the locations DataFrame has three False rows and two True rows. Since location already has one True, we can fill in the remaining spots with three Falses and one True. It doesn’t matter which rows we make True and which ones we make False, since either way the merge will produce the same number of rows for each (5 each for every False and 2 each for every True).


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 5.2

True or False: It is possible to fill in the four blanks in the "is_fraud" column of locations so that the DataFrame seven.merge(locations, on="is_fraud") has 14 rows.

Answer: False

As we discovered by solving problem 5.1, each False value in locations gives rise to 5 rows of the merged DataFrame, and each True value gives rise to 2 rows. This means that the number of rows in the merged DataFrame will be m\cdot5 + n\cdot2, where m is the number of Falses in location and n is the number of Trues in location. Namely, m and n are integers that add up to 5. There’s only a few possibilities so we can try them all, and see that none add up 14:

  • 0\cdot5 + 5\cdot2 = 10

  • 1\cdot5 + 4\cdot2 = 13

  • 2\cdot5 + 3\cdot2 = 16

  • 3\cdot5 + 2\cdot2 = 19

  • 4\cdot5 + 1\cdot2 = 22


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Source: fa23-midterm — Q4

Problem 6

For those who plan on having children, an important consideration when deciding whether to live in an area is the cost of raising children in that area. The DataFrame expensive, defined below, contains all of the rows in living_cost where the "avg_childcare_cost" is at least $20,000.

expensive = living_cost[living_cost.get("avg_childcare_cost") 
                        >= 20000]

We’ll call a county an “expensive county" if there is at least one "family_type" in that county with an "avg_childcare_cost" of at least $20,000. Note that all expensive counties appear in the expensive DataFrame, but some may appear multiple times (if they have multiple "family_type"s with an "avg_childcare_cost" of at least $20,000).

Recall that the "is_metro" column contains Boolean values indicating whether or not each county is part of a metropolitan (urban) area. For all rows of living_cost (and, hence, expensive) corresponding to the same geographic location, the value of "is_metro" is the same. For instance, every row corresponding to San Diego County has an "is_metro" value of True.

Fill in the blanks below so that the result is a DataFrame indexed by "state" where the "is_metro" column gives the proportion of expensive counties in each state that are part of a metropolitan area. For example, if New Jersey has five expensive counties and four of them are metropolitan, the row corresponding to a New Jersey should have a value of 0.8 in the "is_metro" column.

(expensive.groupby(____(a)____).max()
          .reset_index()
          .groupby(____(b)____).____(c)____)


Problem 6.1

What goes in blank (a)?

Answer: ["state", "county"] or ["county", "state"]

We are told that all expensive counties appear in the expensive DataFrame, but some may appear multiple times, for several different "family_type" values. The question we want to answer, however, is about the proportion of expensive counties in each state that are part of a metropolitan area, which has nothing to do with "family_type". In other words, we don’t want or need multiple rows corresponding to the same US county.

To keep just one row for each US county, we can group by both "state" and "county" (in either order). Then the resulting DataFrame will have one row for each unique combination of "state" and "county", or one row for each US county. Notice that the .max() aggregation method keeps the last alphabetical value from the "is_metro" column in each US county. If there are multiple rows in expensive corresponding to the same US county, we are told that they will all have the same value in the "is_metro" column, so taking the maximum just takes any one of these values, which are all the same. We could have just as easily taken the minimum.

Notice the presence of .reset_index() in the provided code. That is a clue that we may need to group by multiple columns in this problem!


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 14%.


Problem 6.2

What goes in blank (b)?

Answer: "state"

Now that we have one row for each US county that is considered expensive, we want to proceed by calculating the proportion of expensive counties within each state that are in a metropolitan area. Our goal is to organize the counties by state and create a DataFrame indexed only by "state" so we want to group by "state" to achieve this.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


Problem 6.3

What goes in blank (c)?

Answer: mean()

Recall that the "is_metro" column consists of Boolean values, where True equals 1 and False equals 0. Notice that if we take the average of the "is_metro" column for all the counties in a given state, we’ll be computing the sum of these 0s and 1s (or the number of True values) divided by the total number of expensive counties in that state. This gives the proportion of expensive counties in the state that are in a metropolitan area. Thus, when we group the expensive counties according to what state they are in, we can use the .mean() aggregation method to calculate the proportion of expensive counties in each state that are in a metropolitan area.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 35%.



Source: fa23-midterm — Q6

Problem 7

Recall that living_cost has 31430 rows, one for each of the ten possible "family_type" values in each of the 3143 US counties.

Consider the function state_merge, defined below.

def state_merge(A, B):
    state_A = living_cost[living_cost.get("state") == A]
    state_B = living_cost[living_cost.get("state") == B]
    return state_A.merge(state_B, on="family_type").shape[0]

Suppose Montana ("MT") has 5 counties, and suppose state_merge("MT", "NV") evaluates to 1050. How many counties does Nevada ("NV") have? Give your answer as an integer.

Answer 21

We are told Montana has 5 counties. We don’t know how many counties Nevada has, but let’s call the number of counties in Nevada x and see how many rows the merged DataFrame should have, in terms of x. If Montana has 5 counties, since there are 10 "family_type" values per county, this means the state_A DataFrame has 50 rows.

Similarly, if Nevada has x counties, then state_B has 10x rows. When we merge on "family_type", each of the 5 rows in state_A with a given "family_type" (say "2a3c") will match with each of the x rows in state_B with that same "family_type". This will lead to 5x rows in the output corresponding to each "family_type", and since there are 10 different values for "family_type", this means the final output will have 50x rows.

We are told that the merged DataFrame has 1050 rows, so we can find x by solving 50x = 1050, which leads to x = 21.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.


Source: fa24-midterm — Q8

Problem 8

Select the correct way to fill in the blank such that the code below evaluates to True.

treat.groupby(______).mean().shape[0] == treat.shape[0]

Answer: ["address", "candy"]

.shape returns a tuple containing the number of rows and number of columns of a DataFrame respectively. By indexing .shape[0] we get the number of rows. In the above question, we are comparing whether the number of rows of treat grouped by its column(s) is equal to the number of rows of the original treat itself. This is only possible when there is a unique row for each value in the column or for each combination of columns. Since it is possible for an address to give out different types of candy, values in "address" can show up multiple times. Similarly, values in "candy" can also show up multiple times since more than one house may give out a specific candy. A neighborhood has multiple houses, so if a neighborhood has more than one house, "neighborhood" will appear multiple times.

% write for combinations here % Each address gives out a specific candy only once, and hence ["address", "candy"] would have a unique row for each combination. This would make the number of rows in the grouped DataFrame equal to treat itself. Multiple neighborhoods might be giving out the same candy or a single neighborhood could be giving out multiple candies, so ["candy", "neighborhood"] is not the answer. Finally, a neighborhood can have multiple addresses, but each address could be giving out more than one candy, which would mean this combination would occur multiple times in treat, which means this would also not be an answer. Since ["address", "candy"] is the only combination that gives a unique row for each combination, the grouped DataFrame would contain the same number of rows as treat itself.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Source: fa24-midterm — Q9

Problem 9

Assume that all houses in treat give out the same size candy, say fun-sized. Suppose we have an additional DataFrame, trick, which is indexed by "candy" and has one column, "price", containing the cost in dollars of a single piece of fun-sized candy, as a float.

Suppose that:

Consider the following line of code:

trick_or_treat = trick.merge(treat, left_index = True, right_on = "candy")

How many rows does trick_or_treat have?

Answer: 200

We are told that trick has 25 rows: 15 from candies that are in treat and 10 additional candies. This means that each candy in trick appears exactly once because 15+10= 25. In addition, a general property when merging dataframes is that the number of rows for one shared value between the dataframes is the product of the number of occurences in either dataframe. For example, if Twix occurs 5 times in treat, the number of times it occurs in trick_or_treat is 5 * 1 = 5 (it occurs once in trick). Using this logic, we can determine how many rows are in trick_or_treat. Since each number of candies is multipled by one and they sum up to 200, the number of rows will be 200.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.


Problem 10

Recall that we have the complete set of currently available discounts in the DataFrame offers.

The DataFrame with_offers is created as follows.

    (with_offers = ikea.take(np.arange(6))
                       .merge(offers, left_on='category',  
                                      right_on='eligible_category'))


Problem 10.1

How many rows does with_offers have?

Answer: 11

First, consider the DataFrame ikea.take(np.arange(6)), which contains the first six rows of ikea. We know the contents of these first six rows from the preview of the DataFrame at the start of this exam. To merge with offers, we need to look at the 'category' of each of these six rows and see how many rows of offers have the same value in the 'eligible_category' column.

The first row of ikea.take(np.arange(6)) is a 'bed'. There are two 'bed' offers, so that will create two rows in the output DataFrame. Similarly, the second row of ikea.take(np.arange(6)) creates two rows in the output DataFrame because there are two 'outdoor' offers. The third row creates just one row in the output, since there is only one 'dresser' offer. Continuing row-by-row in this way, we can sum the number of rows created to get: 2+2+1+2+2+2 = 11.

Pandas Tutor is a really helpful tool to visualize the merge process. Below is a color-coded visualization of this merge, generated by the code here.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 41%.


Problem 10.2

How many rows of with_offers have a value of 20 in the 'percent_off' column?

Answer: 2

There is just one offer with a value of 20 in the 'percent_off' column, and this corresponds to an offer on a 'bed'. Since there are two rows of ikea.take(np.arange(6)) with a 'category' of 'bed', each of these will match with the 20 percent-off offer, creating two rows of with_offers with a value of 20 in the 'percent_off' column.

The visualization from Pandas Tutor below confirms our answer. The two rows with a value of 20 in the 'percent_off' column are both shown in rows 0 and 2 of the output DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 10.3

If you can use just one offer per product, you’d want to use the one that saves you the most money, which we’ll call the best offer.

True or False: The expression below evaluates to a Series indexed by 'product' with the name of the best offer for each product that appears in the with_offers DataFrame.

with_offers.groupby('product').max().get('offer')

Answer: False

Recall that groupby applies the aggregation function separately to each column. Applying the .max() aggregate on the 'offer' column for each group gives the name that is latest in alphabetical order because it contains strings, whereas applying the .max() aggregate on the 'percent_off' column gives the largest numerical value. These don’t necessarily go together in with_offers.

In particular, the element of with_offers.groupby('product').max().get('offer') corresponding to the LAPPLAND TV storage unit will be 'get_organized_promo'. This happens because the two rows of with_offers corresponding to the LAPPLAND TV storage unit have values of 'get_organized_promo' and 'birthday coupon', but 'get_organized_promo' is alphabetically later, so it’s considered the max by .groupby. However, the 'birthday coupon' is actually a better offer, since it’s 15 percent off, while the 'get_organized_promo' is only 10 percent off. The expression does not actually find the best offer for each product, but instead finds the latest alphabetical offer for each product.

We can see this directly by looking at the output of Pandas Tutor below, generated by this code.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 10.4

You want to add a column to with_offers containing the price after the offer discount is applied.

with_offers = with_offers.assign(after_price = _________)
with_offers

Which of the following could go in the blank? Select all that apply.

Answer: with_offers.get('price')*(100 - with_offers.get('percent_off'))/100, (with_offers.get('price') - with_offers.get('price')*with_offers.get('percent_off')/100)

Notice that all the answer choices use with_offers.get('price'), which is a Series of prices, and with_offers.get('percent_off'), which is a Series of associated percentages off. Using Series arithmetic, which works element-wise, the goal is to create a Series of prices after the discount is applied.

For example, the first row of with_offers corresponds to an item with a price of 254 dollars, and a discount of 20 percent off, coming from the snooze sale. This means its price after the discount should be 80 percent of the original value, which is 254*0.8 = 203.2 dollars.

Let’s go through each answer in order, working with this example.

The first answer choice takes the price and subtracts the percent off divided by 100. For our example, this would compute the discounted price as 254 - 20/100 = 253.8 dollars, which is incorrect.

The second answer choice multiplies the price by the quantity 100 minus the percent off, then divides by 100. This works because when we subtract the percent off from 100 and divide by 100, the result represents the proportion or fraction of the cost we must pay, and so multiplying by the price gives the price after the discount. For our example, this comes out to 254*(100-20)/100= 203.2 dollars, which is correct.

The third answer choice is also correct. This corresponds to taking the original price and subtracting from it the dollar amount off, which comes from multiplying the price by the percent off and dividing by 100. For our example, this would be computed as 254 - 254*20/100 = 203.2 dollars, which is correct.

The fourth answer multiplies the price by the quantity 100 minus the percent off divided by 100. For our example, this would compute the discounted price as 254*(100 - 20/100) = 25349.2 dollars, a number that’s nearly one hundred times the original price!

Therefore, only the second and third answer choices are correct.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 11

For each offer currently available, the DataFrame offers has a separate row for each category of products to which the offer can be applied. The columns are:

The full offers DataFrame is shown below. All rows are pictured.


The DataFrame with_offers is created as follows.

    (with_offers = ikea.take(np.arange(6))
                       .merge(offers, left_on='category',  
                                      right_on='eligible_category'))


Problem 11.1

How many rows does with_offers have?

Answer: 11

First, consider the DataFrame ikea.take(np.arange(6)), which contains the first six rows of ikea. We know the contents of these first six rows from the preview of the DataFrame at the start of this exam. To merge with offers, we need to look at the 'category' of each of these six rows and see how many rows of offers have the same value in the 'eligible_category' column.

The first row of ikea.take(np.arange(6)) is a 'bed'. There are two 'bed' offers, so that will create two rows in the output DataFrame. Similarly, the second row of ikea.take(np.arange(6)) creates two rows in the output DataFrame because there are two 'outdoor' offers. The third row creates just one row in the output, since there is only one 'dresser' offer. Continuing row-by-row in this way, we can sum the number of rows created to get: 2+2+1+2+2+2 = 11.

Pandas Tutor is a really helpful tool to visualize the merge process. Below is a color-coded visualization of this merge, generated by the code here.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 41%.


Problem 11.2

How many rows of with_offers have a value of 20 in the 'percent_off' column?

Answer: 2

There is just one offer with a value of 20 in the 'percent_off' column, and this corresponds to an offer on a 'bed'. Since there are two rows of ikea.take(np.arange(6)) with a 'category' of 'bed', each of these will match with the 20 percent-off offer, creating two rows of with_offers with a value of 20 in the 'percent_off' column.

The visualization from Pandas Tutor below confirms our answer. The two rows with a value of 20 in the 'percent_off' column are both shown in rows 0 and 2 of the output DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Problem 11.3

If you can use just one offer per product, you’d want to use the one that saves you the most money, which we’ll call the best offer.

True or False: The expression below evaluates to a Series indexed by 'product' with the name of the best offer for each product that appears in the with_offers DataFrame.

with_offers.groupby('product').max().get('offer')

Answer: False

Recall that groupby applies the aggregation function separately to each column. Applying the .max() aggregate on the 'offer' column for each group gives the name that is latest in alphabetical order because it contains strings, whereas applying the .max() aggregate on the 'percent_off' column gives the largest numerical value. These don’t necessarily go together in with_offers.

In particular, the element of with_offers.groupby('product').max().get('offer') corresponding to the LAPPLAND TV storage unit will be 'get_organized_promo'. This happens because the two rows of with_offers corresponding to the LAPPLAND TV storage unit have values of 'get_organized_promo' and 'birthday coupon', but 'get_organized_promo' is alphabetically later, so it’s considered the max by .groupby. However, the 'birthday coupon' is actually a better offer, since it’s 15 percent off, while the 'get_organized_promo' is only 10 percent off. The expression does not actually find the best offer for each product, but instead finds the latest alphabetical offer for each product.

We can see this directly by looking at the output of Pandas Tutor below, generated by this code.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 11.4

You want to add a column to with_offers containing the price after the offer discount is applied.

with_offers = with_offers.assign(after_price = _________)
with_offers

Which of the following could go in the blank? Select all that apply.

Answer: with_offers.get('price')*(100 - with_offers.get('percent_off'))/100, (with_offers.get('price') - with_offers.get('price')*with_offers.get('percent_off')/100)

Notice that all the answer choices use with_offers.get('price'), which is a Series of prices, and with_offers.get('percent_off'), which is a Series of associated percentages off. Using Series arithmetic, which works element-wise, the goal is to create a Series of prices after the discount is applied.

For example, the first row of with_offers corresponds to an item with a price of 254 dollars, and a discount of 20 percent off, coming from the snooze sale. This means its price after the discount should be 80 percent of the original value, which is 254*0.8 = 203.2 dollars.

Let’s go through each answer in order, working with this example.

The first answer choice takes the price and subtracts the percent off divided by 100. For our example, this would compute the discounted price as 254 - 20/100 = 253.8 dollars, which is incorrect.

The second answer choice multiplies the price by the quantity 100 minus the percent off, then divides by 100. This works because when we subtract the percent off from 100 and divide by 100, the result represents the proportion or fraction of the cost we must pay, and so multiplying by the price gives the price after the discount. For our example, this comes out to 254*(100-20)/100= 203.2 dollars, which is correct.

The third answer choice is also correct. This corresponds to taking the original price and subtracting from it the dollar amount off, which comes from multiplying the price by the percent off and dividing by 100. For our example, this would be computed as 254 - 254*20/100 = 203.2 dollars, which is correct.

The fourth answer multiplies the price by the quantity 100 minus the percent off divided by 100. For our example, this would compute the discounted price as 254*(100 - 20/100) = 25349.2 dollars, a number that’s nearly one hundred times the original price!

Therefore, only the second and third answer choices are correct.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



Problem 12

Fill in the blank in the code below so that chronological is a DataFrame with the same rows as sungod, but ordered chronologically by appearance on stage. That is, earlier years should come before later years, and within a single year, artists should appear in the DataFrame in the order they appeared on stage at Sun God. Note that groupby automatically sorts the index in ascending order.

chronological = sungod.groupby(___________).max().reset_index()

Answer: ['Year', 'Appearance_Order']

The fact that groupby automatically sorts the index in ascending order is important here. Since we want earlier years before later years, we could group by 'Year', however if we just group by year, all the artists who performed in a given year will be aggregated together, which is not what we want. Within each year, we want to organize the artists in ascending order of 'Appearance_Order'. In other words, we need to group by 'Year' with 'Appearance_Order' as subgroups. Therefore, the correct way to reorder the rows of sungod as desired is sungod.groupby(['Year', 'Appearance_Order']).max().reset_index(). Note that we need to reset the index so that the resulting DataFrame has 'Year' and 'Appearance_Order' as columns, like in sungod.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 13

Another DataFrame called music contains a row for every music artist that has ever released a song. The columns are:

You want to know how many musical genres have been represented at Sun God since its inception in 1983. Which of the following expressions produces a DataFrame called merged that could help determine the answer?

Answer: merged = sungod.merge(music, left_on='Artist', right_on='Name')

The question we want to answer is about Sun God music artists’ genres. In order to answer, we’ll need a DataFrame consisting of rows of artists that have performed at Sun God since its inception in 1983. If we merge the sungod DataFrame with the music DataFrame based on the artist’s name, we’ll end up with a DataFrame containing one row for each artist that has ever performed at Sun God. Since the column containing artists’ names is called 'Artist' in sungod and 'Name' in music, the correct syntax for this merge is merged = sungod.merge(music, left_on='Artist', right_on='Name'). Note that we could also interchange the left DataFrame with the right DataFrame, as swapping the roles of the two DataFrames in a merge only changes the ordering of rows and columns in the output, not the data itself. This can be written in code as merged = music.merge(sungod, left_on='Name', right_on='Artist'), but this is not one of the answer choices.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 14

In the next few parts, consider the following answer choices.

  1. The name of the country with the most cities.

  2. The name of the country with the fewest cities.

  3. The number of cities in the country with the most cities.

  4. The number of cities in the country with the fewest cities.

  5. The last city, alphabetically, in the first country, alphabetically.

  6. The first city, alphabetically, in the first country, alphabetically.

  7. Nothing, because it errors.


Problem 14.1

What does the following expression evaluate to?

sun.groupby("Country").max().get("City").iloc[0]

Answer: E. The last city, alphabetically, in the first country, alphabetically.

Let’s break down the code:

  • sun.groupby("Country").max(): This line of code groups the sun DataFrame by the "Country" column and then determines the maximum for every other column within each country group. Since the values in the "City" column are stored as strings, and the maximum of a Series of strings is the last string alphabetically, the values in the "City" column of this DataFrame will contain the last city, alphabetically, of each country.

  • .get("City"): .get("City") accesses the "City" column.

  • .iloc[0]: Finally, .iloc[0] selects the "City" value from the first row. The first row corresponds to the first country alphabetically because groupby sorted the DataFrame by "Country" in ascending order. The value in the "City" column that .iloc[0] selects, then, is the name of the last city, alphabetically, in the first country, alphabetically.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.



Problem 14.2

What does the following expression evaluate to?

sun.groupby("Country").sum().get("City").iloc[0]

Answer: G. Nothing, because it errors.

Let’s break down the code:

  • sun.groupby("Country").sum(): This groups the sun DataFrame by the "Country" column and computes the sum for each numeric column within each country group. Since "City" is non-numeric, it will be dropped.

  • .get("City"): This operation attempts to retrieve the "City" column from the resulting DataFrame. However, since the "City" column was dropped in the previous step, this will raise a KeyError, indicating that the column is not present in the DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.



Problem 14.3

What does the following expression evaluate to?

sun.groupby("Country").count().sort_values("Jan").index[-1]

Answer: A. The name of the country with the most cities.

Let’s break down the code:

  • sun.groupby("Country").count(): This groups the sun DataFrame by the "Country" column. The .count() method then returns the number of rows in each group for each column. Since we’re grouping by "Country", and since the rows in sun correspond to cities, this is counting the number of cities in each country.

  • .sort_values("Jan"): The result of the previous operation is a DataFrame with "Country" as the index and the number of cities per country stored in every other column. The "City, "Jan", "Feb", "Mar", etc. columns in the resulting DataFrame all contain the same information. Sorting by "Jan" sorts the DataFrame by the number of cities each country has in ascending order.

  • .index[-1]: This retrieves the last index value from the sorted DataFrame, which corresponds to the name of the country with the most cities.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 14.4

What does the following expression evaluate to?

sun.groupby("Country").count().sort_values("City").get("City").iloc[-1]

Answer: C. The number of cities in the country with the most cities.

Let’s break down the code:

  • sun.groupby("Country").count(): This groups the sun DataFrame by the "Country" column. The .count() method then returns the number of rows in each group for each column. Since we’re grouping by "Country", and since the rows in sun correspond to cities, this is counting the number of cities in each country.

  • .sort_values("City"): The result of the previous operation is a DataFrame with "Country" as the index and the number of "City"s per "Country" stored in every other column. The "City, "Jan", "Feb", "Mar", etc. columns in the resulting DataFrame all contain the same information. Sorting by "City" sorts the DataFrame by the number of cities each country has in ascending order.

  • .get("City"): This retrieves the "City" column from the sorted DataFrame, which contains the number of cities in each country.

  • .iloc[-1]: This gets the last value from the "City" column, which corresponds to the number of cities in the country with the most cities.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 15

Vanessa is a big Formula 1 racing fan, and wants to plan a trip to Monaco, where the Monaco Grand Prix is held. Monaco is an example of a “city-state” — that is, a city that is its own country. Singapore is another example of a city-state.

We’ll say that a row of sun corresponds to a city-state if its "Country" and "City" values are equal.


Problem 15.1

Fill in the blanks so that the expression below is equal to the total number of sunshine hours in October of all city-states in sun.

    sun[__(a)__].__(b)__

What goes in blanks (a) and (b)?

Answer: (a): sun.get("Country") == sun.get("City"), (b): .get("Oct").sum()

What goes in blank (a)? sun.get("Country") == sun.get("City")
This expression compares the "Country" column to the "City" column for each row in the sun DataFrame. It returns a Boolean Series where each value is True if the corresponding "Country" and "City" are the same (indicating a city-state) and False otherwise.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



What goes in blank (b)? .get("Oct").sum()
Here, we select the "Oct" column, which represents the sunshine hours in October, and compute the sum of its values. By using this after querying for city-states, we calculate the total sunshine hours in October across all city-states in the sun DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.



Source: sp23-final — Q1

Problem 16


Problem 16.1

Complete the implementation of the function most_sunshine, which takes in country, the name of a country, and month, the name of a month (e.g. "Apr"), and returns the name of the city (as a string) in country with the most sunshine hours in month, among the cities in sun. Assume there are no ties.

    def most_sunshine(country, month):
        country_only = __(a)__
        return country_only.__(b)__

What goes in blanks (a) and (b)?

Answer: (a): sun[sun.get("Country") == country], (b): sort_values(month).get("City").iloc[-1] or sort_values(month, ascending=False).get("City").iloc[0]

What goes in blank (a)? sun[sun.get("Country") == country] To identify cities only within the specified country, we need to query for the rows in the sun DataFrame where the "Country" column matches the given country. The expression sun.get("Country") == country creates a Boolean Series, where each entry is True if the corresponding row’s "Country" column matches the provided country and False otherwise. When this Boolean series is used to index into sun DataFrame, it keeps only the rows for which sun.get("Country") == country is True, effectively giving us only the cities from the specified country.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.



What goes in blank (b)? sort_values(month).get("City").iloc[-1] or sort_values(month, ascending=False).get("City").iloc[0]
To determine the city with the most sunshine hours in the specified month, we sort the queried DataFrame (which only contains cities from the specified country) based on the values in the month column. There are two ways to achieve the desired result:

  • Sort in ascending (increasing) order and then select the last item. This is because after sorting in ascending order, the city with the most sunshine hours will be at the bottom of the DataFrame. We use .iloc[-1] to get the last item after selecting the "City" column with .get("City").
  • Sort in descending order and then select the first item. After sorting in descending order, the city with the most sunshine hours will be at the top of the DataFrame. We use .iloc[0] to get the first item after selecting the "City" column with .get("City").

Both methods will give us the name of the city with the most sunshine hours in the specified month.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.



Problem 16.2

In this part only, assume that all "City" names in sun are unique.

Consider the DataFrame cities defined below.

cities = sun.groupby("City").mean().reset_index()

Fill in the blanks so that the DataFrame that results from the sequence of steps described below is identical to cities.

“Sort sun by (c) in (d) order (e).”

What goes in blank (c)?


What goes in blank (d)?


What goes in blank (e)?

Answer: (c): "City", (d): ascending, (e): drop the "Country" column, and reset the index

Let’s start by understanding the code provided in the question:
The .groupby("City") method groups the data in the sun DataFrame by unique city names. Since every city name in the DataFrame is unique, this means that each group will consist of just one row corresponding to that city.

After grouping by city, the .mean() method computes the average of each column for each group. Again, as each city name is unique, this operation doesn’t aggregate multiple rows but merely reproduces the original values for each city. (For example, the value in the "Jan" column for the row with the index "Hamilton" will just be 229.8, which we see in the first row of the preview of sun.)

Finally, .reset_index() is used to reset the DataFrame’s index. When using .groupby, the column we group by (in this case, "City") becomes the index. By resetting the index, we’re making "City" a regular column again and setting the index to 0, 1, 2, 3, …


What goes in blank (c)? "City"
When we group on "City", the index of the DataFrame is set to "City" names, sorted in ascending alphabetical order (this is always the behavior of groupby). Since all city names are unique, the number of rows in sun.groupby("City").mean() is the same as the number of rows in sun, and so grouping on "City" effectively sorts the DataFrame by "City" and sets the index to "City". To replicate the order in cities, then, we must sort sun by the "City" column in ascending order.


Difficulty: ⭐️

The average score on this problem was 97%.



What goes in blank (d)? ascending
Addressed above.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.



What goes in blank (e)? , drop the "Country" column, and reset the index
In the provided code, after grouping by "City" and computing the mean, we reset the index. This means the "City" column is no longer the index but a regular column, and the DataFrame gets a fresh integer index. To replicate this structure, we need to reset the index in our sorted DataFrame. Additionally, when we applied the .mean() method after grouping, any non-numeric columns (like "Country") that we can’t take the mean of are automatically excluded from the resulting DataFrame. To match the structure of cities, then, we must drop the "Country" column from our sorted DataFrame.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Problem 16.3

True or False: In the code below, Z is guaranteed to evaluate to True.

x = sun.groupby(["Country", "Year"]).mean().shape[0]
y = sun.groupby("Country").mean().shape[0]
z = (x >= y)

Answer: True
Let’s us look at each line of code separately:

  • x = sun.groupby(["Country", "Year"]).mean().shape[0]: This line groups the sun DataFrame by both "Country" and "Year", then computes the mean. As a result, each unique combination of "Country" and "Year" will have its own row. For instance, if there are three different values in the "Year" column for a particular country, that country will appear three times in the DataFrame sun.groupby(["Country", "Year"]).mean().

  • y = sun.groupby("Country").mean().shape[0]: When grouping by "Country" alone, each unique country in the sun DataFrame is represented by one row, independent of the information in other columns.

  • z = (x >= y): This comparison checks whether the number of rows produced by grouping by both "Country" and "Year" (which is x) is greater than or equal to the number of rows produced by grouping only by "Country" (which is y).

Given our grouping logic:

  • If every country in the sun DataFrame has only a single unique value in the "Year" column (e.g. if the "Year" value for all ciites in the United States was always 3035.9, and if the "Year" value for all cities in Nigeria was always 1845.4, etc.), then the number of rows when grouping by both "Country" and "Year" will be equal to the number of rows when grouping by "Country" alone. In this scenario, x will be equal to y.

  • If at least one country in the sun DataFrame has at least two different values in the "Year" column (e.g. if there are at least two cities in the United States with different values in the "Year" column), then there will be more rows when grouping by both "Country" and "Year" compared to grouping by "Country" alone. This means x will be greater than y.

Considering the above scenarios, there’s no situation where the value of x can be less than the value of y. Therefore, z will always evaluate to True.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


In the next few parts, consider the following answer choices.

  1. The name of the country with the most cities.

  2. The name of the country with the fewest cities.

  3. The number of cities in the country with the most cities.

  4. The number of cities in the country with the fewest cities.

  5. The last city, alphabetically, in the first country, alphabetically.

  6. The first city, alphabetically, in the first country, alphabetically.

  7. Nothing, because it errors.


Problem 16.4

What does the following expression evaluate to?

sun.groupby("Country").max().get("City").iloc[0]

Answer: E. The last city, alphabetically, in the first country, alphabetically.

Let’s break down the code:

  • sun.groupby("Country").max(): This line of code groups the sun DataFrame by the "Country" column and then determines the maximum for every other column within each country group. Since the values in the "City" column are stored as strings, and the maximum of a Series of strings is the last string alphabetically, the values in the "City" column of this DataFrame will contain the last city, alphabetically, of each country.

  • .get("City"): .get("City") accesses the "City" column.

  • .iloc[0]: Finally, .iloc[0] selects the "City" value from the first row. The first row corresponds to the first country alphabetically because groupby sorted the DataFrame by "Country" in ascending order. The value in the "City" column that .iloc[0] selects, then, is the name of the last city, alphabetically, in the first country, alphabetically.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 36%.



Problem 16.5

What does the following expression evaluate to?

sun.groupby("Country").sum().get("City").iloc[0]

Answer: G. Nothing, because it errors.

Let’s break down the code:

  • sun.groupby("Country").sum(): This groups the sun DataFrame by the "Country" column and computes the sum for each numeric column within each country group. Since "City" is non-numeric, it will be dropped.

  • .get("City"): This operation attempts to retrieve the "City" column from the resulting DataFrame. However, since the "City" column was dropped in the previous step, this will raise a KeyError, indicating that the column is not present in the DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.



Problem 16.6

What does the following expression evaluate to?

sun.groupby("Country").count().sort_values("Jan").index[-1]

Answer: A. The name of the country with the most cities.

Let’s break down the code:

  • sun.groupby("Country").count(): This groups the sun DataFrame by the "Country" column. The .count() method then returns the number of rows in each group for each column. Since we’re grouping by "Country", and since the rows in sun correspond to cities, this is counting the number of cities in each country.

  • .sort_values("Jan"): The result of the previous operation is a DataFrame with "Country" as the index and the number of cities per country stored in every other column. The "City, "Jan", "Feb", "Mar", etc. columns in the resulting DataFrame all contain the same information. Sorting by "Jan" sorts the DataFrame by the number of cities each country has in ascending order.

  • .index[-1]: This retrieves the last index value from the sorted DataFrame, which corresponds to the name of the country with the most cities.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.



Problem 16.7

What does the following expression evaluate to?

sun.groupby("Country").count().sort_values("City").get("City").iloc[-1]

Answer: C. The number of cities in the country with the most cities.

Let’s break down the code:

  • sun.groupby("Country").count(): This groups the sun DataFrame by the "Country" column. The .count() method then returns the number of rows in each group for each column. Since we’re grouping by "Country", and since the rows in sun correspond to cities, this is counting the number of cities in each country.

  • .sort_values("City"): The result of the previous operation is a DataFrame with "Country" as the index and the number of "City"s per "Country" stored in every other column. The "City, "Jan", "Feb", "Mar", etc. columns in the resulting DataFrame all contain the same information. Sorting by "City" sorts the DataFrame by the number of cities each country has in ascending order.

  • .get("City"): This retrieves the "City" column from the sorted DataFrame, which contains the number of cities in each country.

  • .iloc[-1]: This gets the last value from the "City" column, which corresponds to the number of cities in the country with the most cities.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Source: sp23-final — Q2

Problem 17

Vanessa is a big Formula 1 racing fan, and wants to plan a trip to Monaco, where the Monaco Grand Prix is held. Monaco is an example of a “city-state” — that is, a city that is its own country. Singapore is another example of a city-state.

We’ll say that a row of sun corresponds to a city-state if its "Country" and "City" values are equal.


Problem 17.1

Fill in the blanks so that the expression below is equal to the total number of sunshine hours in October of all city-states in sun.

    sun[__(a)__].__(b)__

What goes in blanks (a) and (b)?

Answer: (a): sun.get("Country") == sun.get("City"), (b): .get("Oct").sum()

What goes in blank (a)? sun.get("Country") == sun.get("City")
This expression compares the "Country" column to the "City" column for each row in the sun DataFrame. It returns a Boolean Series where each value is True if the corresponding "Country" and "City" are the same (indicating a city-state) and False otherwise.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.



What goes in blank (b)? .get("Oct").sum()
Here, we select the "Oct" column, which represents the sunshine hours in October, and compute the sum of its values. By using this after querying for city-states, we calculate the total sunshine hours in October across all city-states in the sun DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 17.2

Fill in the blanks below so that the expression below is also equal to the total number of sunshine hours in October of all city-states in sun.

Note: What goes in blank (b) is the same as what goes in blank (b) above.

sun.get(["Country"]).merge(__(c)__).__(b)__

What goes in blank (c)?

Answer: sun, left_on="Country", right_on="City"

Let’s break down the code:

  • sun.get(["Country"]): This extracts just the "Country" column from the sun DataFrame, as a DataFrame. (It’s extracted as a DataFrame since we passed a list to .get instead of a single string.)

  • .merge(sun, left_on="Country", right_on="City"): Here, we’re using the .merge method to merge a version of sun with just the "Country" column (which is our left DataFrame) with the entire sun DataFrame (which is our right DataFrame). The merge is done by matching "Country"s from the left DataFrame with "City"s from the right DataFrame. This way, rows in the resulting DataFrame correspond to city-states, as it only contains entries where a country’s name is the same as a city’s name.

  • .get("Oct").sum(): After merging, we use .get("Oct") to retrieve the "Oct" column, which represents the sunshine hours in October. Finally, .sum() computes the total number of sunshine hours in October for all the identified city-states.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.



Source: sp23-midterm — Q5

Problem 18

Teresa and Sophia are bored while waiting in line at Bistro and decide to start flipping a UCSD-themed coin, with a picture of King Triton’s face as the heads side and a picture of his mermaid-like tail as the tails side.


Teresa flips the coin 21 times and sees 13 heads and 8 tails. She stores this information in a DataFrame named teresa that has 21 rows and 2 columns, such that:

Then, Sophia flips the coin 11 times and sees 4 heads and 7 tails. She stores this information in a DataFrame named sophia that has 11 rows and 2 columns, such that:


Problem 18.1

How many rows are in the following DataFrame? Give your answer as an integer.

    teresa.merge(sophia, on="flips")

Hint: The answer is less than 200.

Answer: 108

Since we used the argument on="flips, rows from teresa and sophia will be combined whenever they have matching values in their "flips" columns.

For the teresa DataFrame:

  • There are 13 rows with "Heads" in the "flips" column.
  • There are 8 rows with "Tails" in the "flips" column.

For the sophia DataFrame:

  • There are 4 rows with "Heads" in the "flips" column.
  • There are 7 rows with "Tails" in the "flips" column.

The merged DataFrame will also only have the values "Heads" and "Tails" in its "flips" column. - The 13 "Heads" rows from teresa will each pair with the 4 "Heads" rows from sophia. This results in 13 \cdot 4 = 52 rows with "Heads" - The 8 "Tails" rows from teresa will each pair with the 7 "Tails" rows from sophia. This results in 8 \cdot 7 = 56 rows with "Tails".

Then, the total number of rows in the merged DataFrame is 52 + 56 = 108.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.



Problem 18.2

Let A be your answer to the previous part. Now, suppose that:

Suppose we again merge teresa and sophia on the "flips" column. In terms of A, how many rows are in the new merged DataFrame?

Answer: A+1

The additional row in each DataFrame has a unique "flips" value of "Total". When we merge on the "flips" column, this unique value will only create a single new row in the merged DataFrame, as it pairs the "Total" from teresa with the "Total" from sophia. The rest of the rows are the same as in the previous merge, and as such, they will contribute the same number of rows, A, to the merged DataFrame. Thus, the total number of rows in the new merged DataFrame will be A (from the original matching rows) plus 1 (from the new "Total" rows), which sums up to A+1.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 46%.



Source: sp24-final — Q2

Problem 19

Michelle and Abel are each touring apartments for where they might live next year. Michelle wants to be close to UCSD so she can attend classes easily. Abel is graduating and wants to live close to the beach so he can surf. Each person makes their own DataFrame (called michelle and abel respectively), to keep track of all the apartments that they toured. Both michelle and abel came from querying apts, so both DataFrames have the same columns and structure as apts.

Here are some details about the apartments they toured.

We’ll assume for this problem only that there is just one apartment of each size available at each complex, so that if they both tour a one bedroom apartment at the same complex, it is the exact same apartment with the same "Apartment ID".


Problem 19.1

What does the following expression evaluate to?

michelle.merge(abel, left_index=True, right_index=True).shape[0]

Answer: 8

This expression uses the indices of michelle and abel to merge. Since both use the index of "Apartment ID" and we are assuming that there is only one apartment of each size available at each complex, we only need to see how many unique apartments michelle and abel share. Since there are 8 complexes that they both visited, only the one bedroom apartments in these complexes will be displayed in the resulting merged DataFrame. Therefore, we will only have 8 apartments, or 8 rows.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.


Problem 19.2

What does the following expression evaluate to?

 michelle.merge(abel, on=“Bed”).shape[0]

Answer: 240

This expression merges on the "Bed" column, so we need to look at the data in this column for the two DataFrames. Within this column, michelle and abel share only one specific type of value: "One". With the details that are given, michelle has 12 rows containing this value while abel has 20 rows containing this value. Since we are merging on this row, each row in abel that contains the "One" value will be matched with a row in michelle that also contains the value, meaning one row in michelle will turn into twelve after the merge.

Thus, to compute the total number of rows from this merge expression, we multiply the number of rows in michelle with the number of rows in abel that fit the cross-criteria of "Bed". Numerically, this would be 12 \cdot 20 = 240.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 33%.


Problem 19.3

What does the following expression evaluate to?


michelle.merge(abel, on=“Complex”).shape[0] 

Answer: 32

To approach this question, we first need to determine how many complexes Michelle and Abel have in common: 8. We also know that each complex was toured twice by both Michelle and Abel, so there are two copies of each complex in the michelle and abel DataFrames. Therefore, when we merge the DataFrames, the two copies of each complex will match with each other, effectively creating four copies for each complex from the original two. Since this is done for each complex, we have 8 \cdot (2 \cdot 2) = 32.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 19%.


Problem 19.4

What does the following expression evaluate to?

 abel.merge(abel, on=“Bed”).shape[0]

Answer: 800

Since this question deals purely with the abel DataFrame, we need to fully understand what is inside it. There are 40 apartments (or rows): 20 one bedrooms and 20 two bedrooms. When we self-merge on the "Bed" column, it is imperative to know that every one bedroom apartment will be matched with the 20 other one bedroom apartments (including itself)! This also goes for the two bedroom apartments. Therefore, we have 20 \cdot 20 + 20 \cdot 20 = 800.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 28%.



Source: sp24-final — Q3

Problem 20

We wish to compare the average rent for studio apartments in different complexes.


Problem 20.1

Our goal is to create a DataFrame studio_avg where each complex with studio apartments appears once. The DataFrame should include a column named "Rent" that contains the average rent for all studio apartments in that complex. For each of the following strategies, determine if the code provided works as intended, gives an incorrect answer, or errors.

studio = apts[apts.get("Bed") == "Studio"]
studio_avg = studio.groupby("Complex").mean().reset_index()
studio_avg = apts.groupby("Complex").min().reset_index()
grouped = apts.groupby(["Bed", "Complex"]).mean().reset_index()
studio_avg = grouped[grouped.get("Bed") == "Studio"]
grouped = apts.groupby("Complex").mean().reset_index()
studio_avg = grouped[grouped.get("Bed") == "Studio"]

Answer:

  • (i) Works as intended
  • (ii) Gives an incorrect answer
  • (iii) Works as intended
  • (iv) Errors
  1. studio is set to a DataFrame that is queried from the apts DataFrame so that it contains only rows that have the "Studio" value in "Bed". Then, with studio, it groups by the "Complex" and aggregates by the mean. Finally, it resets its index. Since we have a DataFrame that only has "Studio"s , grouping by the "Complex" will take the mean of every numerical column - including the rent - in the DataFrame per "Complex", effectively reaching our goal.

Difficulty: ⭐️

The average score on this problem was 96%.



  1. studio_avg is created by grouping "Complex" and aggregating by the minimum. However, as the question asks for the average rent, getting the minimum rent of every complex does not reach the conclusion the question asks for.

Difficulty: ⭐️

The average score on this problem was 95%.



  1. grouped is made through first grouping by both the "Bed" and "Complex" columns then taking the mean and resetting the index. Since we are grouping by both of these columns, we separate each type of "Bed" by the "Complex" it belongs to while aggregating by the mean for every numerical column. After resetting the index, we are left with a DataFrame that contains the mean of every "Bed" and "Complex" combination. A sample of the DataFrame might look like this:
Bed Complex Rent
One Costa Verde Village 3200
One Westwood 3000
(Note: This is not an accurate representation of the DataFrame's true values at all!)

Then, when we assign studio_avg, we take this DataFrame and only get the rows in which grouped’s "Bed" column contains "Studio". As we already .groupby()’d and aggregated by the mean for each "Bed" and "Complex" pair, we arrive at the solution the question requests for.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.



  1. For this grouped, we only .groupby() the "Complex" column, aggregate by the mean, and reset index. Then, we attempt to assign studio_avg to the resulting DataFrame of a query from our grouped DataFrame. However, this wouldn’t work at all because when we grouped by "Complex" and aggregated by the mean to create grouped, the .groupby() removed our "Bed" column since it isn’t numerical. Therefore, when we attempt to query by "Bed", babypandas cannot locate such column since it was removed - resulting in an error.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.


Problem 20.2

Consider the DataFrame alternate_approach defined as follows

grouped = apts.groupby(["Bed", "Complex"]).mean().reset_index()
alternate_approach = grouped.groupby("Complex").min()

Suppose that the "Rent" column of alternate_approach has all the same values as the "Rent" column of studio_avg, where studio_avg is the DataFrame described in part (a). Which of the following are valid conclusions about apts? Select all that apply.

Answer: Options 2 and 5

alternate approach first groups by "Bed" and "Complex" , takes the mean of all the columns, and resets the index such that "Bed" and "Complex" are no longer indexes. Now there is one row per "Bed" and "Complex" combination that exists in apts and all columns contain the mean value for each of these "Bed" and "Complex" combinations. Then it groups by "Complex" again, taking the minimum value of all columns. The output is a DataFrame indexed by "Complex" where the "Rent"column contains the minimum rent (from of all the average prices for each type of "Bed").

  • Option 1 is incorrect. This is not necessarily true. The "Rent" column in alternate_approachin contains the minimum of all the average prices for each type of "Bed" and the "Rent" column in "studio_avg" contains the average rent for studios in each type of complex. Even though they contain the same values, this does not mean that no studios exist in any complexes. If this were the case, studio_avg would be an empty DataFrame and alternate_approach would not be.
  • Option 2 is correct. If these columns are the same, that means that for each complex, there must at least one studio apartment. If this was not the case and there were complexes with no studio apartments, complexes that may appear in alternate_approach would not appear in studio_avg.
  • Option 3 is incorrect. This is not necessarily true. Complexes can have more than one studio. studio_avg has the average of all these studios for each complex and alternate_approach will have the minimum rent (from all the average prices for each type of bedroom). Just because the columns are the same does not mean that there is only one studio per complex.
  • Option 4 is incorrect. This is not necessarily true. Just because the columns are the same does not mean that complexes only have studios. Perhaps studios just have the minimum rent on average across all complexes as explored in Option 5 below.
  • Option 5 is correct. studio_avg contains the average price for a studio in each complex. alternate_approach contains the minimum rent from the average rents of all types of bedrooms for each complex. Since these columns are the same, this means that the average price of a studio must be lower (or equal to) the average price of a one bedroom (or any other type of bedroom) for all the rent values in alternate_approach to align with all the values in studio_avg.
  • Option 6 is incorrect. As shown above, there are correct answers to this question.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 20.3

Which data visualization should we use to compare the average prices of studio apartments across complexes?

Answer: Bar chart

Each complex is a categorical data type, so we should use a bar chart to compare average prices.

  • Scatter plots are between two numerical variables.
  • Line charts are typically used to depict changes throughout time.
  • Histograms are used to depict frequency of distribution.

Difficulty: ⭐️⭐️

The average score on this problem was 85%.



Source: sp24-midterm — Q6

Problem 21

Suppose Charlie and Norah each have separate DataFrames for their contacts, called charlie and norah, respectively. These DataFrames have the same column names and format as your DataFrame, contacts.

As illustrated in the diagram below, Charlie has 172 contacts in total, whereas Norah has 88 contacts. 12 of these contacts are shared, meaning they appear in both charlie and norah.


Problem 21.1

What does the following expression evaluate to?

charlie.merge(norah, left_index=True, right_index=True).shape[0]    

Answer: 12

The code merges DataFrames charlie and norah on their indexes, so the resulting DataFrame will contain one row for every match between their indexes (‘Person’ since they follow the same format as DataFrame contact). From the Venn Diagram, we know that Charlie and Norah have 12 contacts in common, so the resulting DataFrame will contain 12 rows: one row for each shared contact.

Thus, charlie.merge(norah, left_index=True, right_index=True).shape[0] returns the row number of the resulting DataFrame, which is 12.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 21.2

One day, when updating her phone’s operating system, Norah accidentally duplicates the 12 contacts she has in common with Charlie. Now, the norah DataFrame has 100 rows.

What does the following expression evaluate to?

norah.merge(norah, left_index=True, right_index=True).shape[0]   

Answer: 24 \cdot 2 + 76 = 124

Since Norah duplicates 12 contacts, the norah DataFrame now has 76 unique rows + 12 rows + 12 duplicated rows. Note that the above code is now merging norah with itself on indexes.

After merging, the resulting DataFrame will contain 76 unique rows, as there is only one match for each unique row. As for the duplicated rows, each row can match twice, and we have 24 rows. Thus the resulting DataFrame’s row number = 76 + 2 \cdot 24 = 124.

For better understanding, imagine we have a smaller DataFrame nor with only one contact Jim. After duplication, it will have two identical rows of Jim. For easier explanation, let’s denote the original row Jim1, and duplicated row Jim2. When merging Nor with itself, Jim1 can be matched with Jim1 and Jim2, and Jim2 can be matched with Jim1 and Jim2, resulting $= 2 = 4 $ number of rows.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 3%.



Source: sp24-midterm — Q8

Problem 22

You wonder if any of your friends have the same birthday, for example two people both born on April 3rd. Fill in the blanks below so that the given expression evaluates to the largest number of people in contacts who share the same birthday.

Note: People do not need to be born in the same year to share a birthday!

contacts.groupby(___(a)___).___(b)___.get("Phone").___(c)___

  • (a): ["Month", "Day"]
  • (b): count()
  • (c): max()

.groupby(["Month", "Day"]).count() groups the DataFrame contacts by each unique combination of ‘Month’ and ‘Day’ (birthday) and then counts the number of rows in each group (i.e. number of people born on that date).

.get('Phone') gets a series which contains the counts of people born on each date, and .max() finds the largest number of people sharing the same birthday.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Source: sp25-midterm — Q5

Problem 23

Suppose we have another DataFrame called trade_partners that has a row for every country that the United States trades with. trade_partners is indexed by "Country" and has two columns:

All countries in tariffs are included in trade_partners (including "European Union"), but not all countries in trade_partners are included in tariffs. The first three rows of trade_partners are shown below.


Problem 23.1

Write one line of code to merge tariffs with trade_partners and store the result in merged.

Answer: merged = tariffs.merge(trade_partners, left_on="Country", right_index=True)

tariffs and trade_partners are both dataframes which correspond to the US’s relationship with other Countries. Since both dataframes contain one row for each country we need to merge them with the column which corresponds to the country name. In tariffs that would be the Country column and in trade_partners that is the index.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 23.2

How many rows does merged have?

Answer: 50

Since each DataFrame has exactly one row per country, the merged result will also have one row for every country they share. And because every country in tariffs appears in trade_partners (though not vice versa), the merged DataFrame will contain exactly as many rows as there are countries in tariffs (which is 50).


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 23.3

In which of the following DataFrames does the "Proportion" column sum to 1? Select all that apply.

Answer: trade_partners and trade_partners.groupby("Continent").sum()

Solving this problem is best done by working through each answer choice and eliminating the incorrect ones. In the problem statement, we are told that the Proportion column contains floats representing the proportion of US imports coming from each country. Since the Proportion column contains proportions, the sum of that column should equal one. Therefore, the first answer choice is a correct option. Moving on to the second choice, grouping by the continent and taking the mean proportion of each continent results in the proportion column containing mean proportions of groups. Since we are no longer working with all of the proportions and instead averages, we can not guarantee the sum of the Proportion column is one. However, because the third answer choice takes the sum of the proportions in each Continent, all of the proportions are still accounted for. As a result, the sum of the proportions column in the new dataframe would still add to one. Finally, as we determined in the previous part of the question, the merged dataframe contains all of the rows in tariffs, but not all of the rows in trade_partners. Per the problem description the rows in the Proportion column of trade_partners should sum to one, since some of those rows are omitted in merged, it is impossible for the Proportion column in merged to sum to one.


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 23.4

Write one line of code that would produce an appropriate data visualization showing the median reciprocal tariff for each continent.

Answer: merged.groupby("Continent").median().plot(kind="barh", y="Reciprocal Tariff");

This question calls for a visualization which shows the median reciprocal tariff for each continent. The first part of solving this problem involves correctly identifying what dataframe to use when plotting the data. In this case, the problem asks for a link between Reciprocal Tariff, a column in the tariffs dataframe, and Continent, a column in the trade_partners dataframe. Therefore, the merged dataframe must be used to create the plot. Within the merged dataframe, the question calls for median reciprocal tariffs for each continent. Currently, the merged dataframe has one row for each country rather than continent. Thus, before plotting the data, the merged dataframe must be grouped by Continent and aggregated by the median() to get the median Reciprocal Tariff for each continent. From there, all that is left is plotting the data. Since there exists one categorical variable, Continent, and one numerical variable, Reciprocal Tariff, a bar chart is appropriate here. Finally, because the dataframe is already indexed by continent after the groupby statement, all that needs to be specified within the plot function is the y variable, in this case, Reciprocal Tariff.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.



Source: su24-final — Q1

Problem 24


Problem 24.1

Fill in the blanks so that the expression below evaluates to the proportion of stages won by the country with the most stage wins.

    stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]

Answer:

  • (i): "Winner Country"
    To calculate the number of stages won by each country, we need to group the data by the Winner Country. This will allow us to compute the counts for each group.

  • (ii): count()
    Once the data is grouped, we use the .count() method to calculate the number of stages won by each country.

  • (iii): max()
    Finds the maximum number of stages won by a single country. Finally, we divide the maximum stage wins by the total number of stages (stages.shape[0]) to calculate the proportion of stages won by the top country.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 24.2

The distance of a stage alone does not encapsulate its difficulty, as riders feel more tired as the tour goes on. Because of this, we want to consider “real distance” a measurement of the length of a stage that takes into account how far into the tour the riders are. The “real distance” is calculated with the following process:

  1. Add one to the stage number.

  2. Take the square root of the result of (i).

  3. Multiply the result of (ii) by the raw distance of the stage.

Complete the implementation of the function real_distance, which takes in stages (a DataFrame), stage (a string, the name of the column containing stage numbers), and distance (a string, the name of the column containing stage distances). real_distance returns a Series containing all of the “real distances” of the stages, as calculated above.

    def real_distance(stages, stage, distance):
         ________

Answer: return stages.get(distance) * np.sqrt(stages.get(stage) + 1)

  • (i): First, We need to add one to the stage number. The stage parameter specifies the name of the column containing the stage numbers. stages.get(stage) retrieves this column as a Series, and we can directly add 1 to each element in the series by stages.get(stage) + 1

  • (ii): Then, to take the square root of the result of (i), we can use np.sqrt(stages.get(stage) + 1)

  • (iii): Finally, we want to multiply the result of (ii) by the raw distance of the stage. The distance parameter specifies the name of the column containing the raw distances of each stage. stages.get(distance) retrieves this column as a pandas Series, and we can directly multiply it by np.sqrt(stages.get(stage) + 1).


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 24.3

Sometimes, stages are repeated in different editions of the Tour de France, meaning that there are some pairs of "Origin" and "Destination" that appear more than once in stages. Fill in the blanks so that the expression below evaluates how often the most common "Origin" and "Destination" pair in the stages DataFrame appears.

stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]

Answer:

  • (i): ["Origin", "Destination"]
    To analyze the frequency of stages with the same origin and destination, we need to group the data by the columns ["Origin", "Destination"]. This groups the stages into unique pairs of origin and destination.

  • (ii): count()
    After grouping, we apply the .count() method to calculate how many times each pair of ["Origin", "Destination"] appears in the dataset. The result is the frequency of each pair.

  • (iii): -1
    After obtaining the frequencies, we sort the resulting groups by their counts in ascending order (this is the default behavior of .sort_values()). The most common pair will then be the last entry in the sorted result. Using .get("Type") extracts the series of counts, and .iloc[-1] retrieves the count of the most common pair, which is at the last position of the sorted series.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 24.4

Fill in the blanks so that the value of mystery_three is the "Destination" of the longest stage before Stage 12.

    mystery = stages[stages.get(__(i)__) < 12]
    mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
    mystery_three = mystery_two.get(__(iii)__).iloc[-1]

Answer:

  • (i): "Stage"
    To filter the DataFrame to include only rows corresponding to stages before Stage 12, we use the "Stage" column. The condition stages.get("Stage") < 12 creates a boolean mask that selects only the rows where the stage number is less than 12.

  • (ii): True
    To find the longest stage, the rows need to be sorted by the "Distance" column. Setting ascending=True ensures that shorter stages come first and the longest stage appears last in the sorted DataFrame.

  • (iii): "Destination"
    After sorting, we want to retrieve the "Destination" of the longest stage. Using .get("Destination") retrieves the "Destination" column, and .iloc[-1] accesses the last row in the sorted DataFrame, corresponding to the longest stage before Stage 12.


Difficulty: ⭐️

The average score on this problem was 92%.



Source: su24-midterm — Q5

Problem 25


Problem 25.1

Fill in the blanks so that the sentence below correctly describes the meaning of mystery.

“The __(i)__ __(ii)__ of the __(iii)__ __(iv)__ .”

mystery = kart.groupby("Region").mean().get("Ranking").min()

What goes in (i)?

What goes in (ii)?

What goes in (iii)?

What goes in (iv)?

Answer:

  • (i) average
  • (ii) ranking
  • (iii) best
  • (iv) region

mystery groups by the "Region", calculates the mean values and retrieves the "Ranking" column. So we have the average ranking for all the regions. Now we retrieve the minimum value of this column. Note that in this context, a lower average ranking indicates a better team. Hence, we get “the average ranking of the best region”.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Difficulty: ⭐️

The average score on this problem was 95%.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 14%.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 25.2

Fill in the blank in the code snippet below so that my_bool is True if there is at least one university with two teams in the same division, and False if there are no universities with two teams in the same division. Your answer must include the use of .groupby() in order to receive credit.

my_df = ______
my_bool = my_df.get("Team").max() > 1

Answer: kart.groupby(["University", "Division"]).count()

In the second line of code, we get the column “Team” and check whether its maximum value is greater than 1. In other words, it is checking whether there are any universities with multiple teams in a division. This means that our input DataFrame needs to have a numeric column "Team" that denotes the number of teams for each university for each division. This means we need to use groupby(). Since we want this information for each university within each division, we need to groupby both “University” and “Division”. Then, because we want the number of teams within each division within each university, you want to apply the count() aggregate. Now all the columns in this DataFrame contain the number of teams per division per university since count notes the number of observations in each category.

From here, my_bool can take this transformed DataFrame, get the “Team” column that contains the number of teams per division per university and check if any of them are greater than 1 (by simply getting the maximum value of this column) and checking whether it is greater than 1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Source: su24-midterm — Q7

Problem 26

The DataFrame div_one, shown below to the left, contains the six Division 1 teams from kart, with just the "Team" and "Region" columns selected.

The DataFrame coach, shown below to the right, contains some additional information about the coaches of the teams in div_one, with a piece of info missing.


Problem 26.1

Fill in the blank in the Region column with either South, West, East, or Midwest so that the DataFrame div_one.merge(coach, on="Region") has:

Answer:

  • 8 rows: South
  • 9 rows: East
  • 10 rows: West
  • 11 rows: Midwest

To determine the number of rows after merging two DataFrames (div_one and coach) on the column "Region", we need to calculate how many matches occur for each region. When performing the merge, each row from div_one will be combined with every matching row in coach. The total number of rows in the resulting DataFrame will depend on how many times each region value in div_one has matching rows in coach.

  1. South (8 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 2 rows with Midwest, and 2 rows with East.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, resulting in 3×2=6 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • Total for South: 3 * 2(Midwest) + 1 * 2(East) = 8 rows.
  2. East (9 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 2 rows with Midwest, and 3 rows with East.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, giving 3×2=6 matches.
    • The single East row in div_one matches with the 3 East rows in coach, contributing 1×3=3 matches.
    • Total for East: 3 * 2(Midwest) + 1 * 3(East) = 9 rows.
  3. West (10 rows):
    • div_one has 3 rows with Midwest, 1 row with East, and 2 rows with West.
    • coach has 2 rows with Midwest, 2 rows with East, and 1 row with West.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, resulting in 3×2=6 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • The 2 West rows in div_one each match with the single West row in coach, contributing 2×1=2 matches.
    • Total for West: 3 * 2(Midwest) + 1 * 2(East) + 2 * 1(West) = 10 rows.
  4. Midwest (11 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 3 rows with Midwest, and 2 rows with East.
    • For each Midwest row in div_one, it matches with all three Midwest rows in coach, resulting in 3×3=9 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • Total for Midwest: 3 * 3(Midwest) + 1 * 2(East) = 11 rows.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 26.2

What is the value of the following Python expression?

coach.merge(coach, on="Coach").merge(coach, on="Coach").shape[0]

Answer: 12

First Merge: The expression starts by merging the coach DataFrame with itself on the “Coach” column. Each row in coach that has the same coach will match with every other row (including itself) that has the same coach.

  • Matching process:
    • Coach Jason (rows 0 and 2): Matches with each other, creating 2×2=4 matches.
    • Coach Jack (row 1): Since this row only matches with itself, it creates 1×1=1 match.
    • Coach Ashley (row 3): Also creates 1×1=1 match.
    • Coach Nick (row 4): Again, 1×1=1 match.
    • Coach Zoe (row 5): Another 1×1=1 match.

Total rows from this merge: 4+1+1+1+1=8 rows.

Second Merge: Now we take the result of the first merge (which has 8 rows) and merge it again with the original coach DataFrame on the “Coach” column.

  • Matching process:
    • Coach Jason 4 rows (from the first merge): each of these rows will match with the 2 rows for “Coach Jason” in the original coach DataFrame. This results in 4×2=8 matches.
    • Coach Jack, Coach Ashley, Coach Nick, and Coach Zoe(all have single row): each will match with their corresponding rows in the original coach DataFrame. Since each has 1×1=1 match, we add 4 more rows in total.

Total rows from this merge: 8+1+1+1+1=12 rows.

Result:
The expression returns the number of rows in the final merged DataFrame.
Result: The final DataFrame has 12 rows.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 19%.



Problem 27

You are given a table called books that contains columns 'author' (str), 'title' (str), 'num_chapters' (int), and 'publication_year' (int).

Problem 27.1

What will be the output of the following code? books.groupby(“publication_year”).mean().shape[1]

Answer: 1

The output will return 1. Notice that the final function call is to .shape[1]. We know that .shape[1] is a call to see how many columns are in the resulting data frame. When we group by publication year, there is only one column that will be aggregated by the groupby call (which is the 'num_chapters' column). The other columns are string, and therefore, will not be aggregated in the groupby call (since you can’t take the mean of a string). Consequently .shape[1] will only result one column for the mean of the 'num_chapters' column.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.



Problem 28

You are given a table called books that contains columns 'author' (str), 'title' (str), 'num_chapters' (int), and 'publication_year' (int).

Problem 28.1

What will be the output of the following code? books.groupby(“publication_year”).mean().shape[1]

Answer: 1

The output will return 1. Notice that the final function call is to .shape[1]. We know that .shape[1] is a call to see how many columns are in the resulting data frame. When we group by publication year, there is only one column that will be aggregated by the groupby call (which is the 'num_chapters' column). The other columns are string, and therefore, will not be aggregated in the groupby call (since you can’t take the mean of a string). Consequently .shape[1] will only result one column for the mean of the 'num_chapters' column.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 28.2

Which of the following strategies would work to compute the absolute difference in the average number of chapters per book for authors “Dean Koontz” and “Charles Dickens”?

Answer: do two queries to get two separate tables (one for each of “Dean Koontz” and “Charles Dickens”), use get on the 'num_chapters' column of each table, use the Series method .mean() on each, compute the absolute value of the difference in these two means

Logically, we want to somehow separate data for author “Dean Koontz” and “Charles Dickens”. (If we don’t we’ll be taking a mean that includes the chapters of books from both authors.) To achieve this separation, we can create two separate tables with a query that specifies a value on the 'author' column. Now having two separate tables, we can aggregate on the 'num_chapters' (the column of interest). To get the 'num_chapters' column we can use the get method. To actually acquire the mean of the 'num_chapters' column we can evoke the .mean() call.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 28.3

Which of the following will produce the same value as the total number of books in the table?

Answer: books.groupby(['Author, 'Title']).count().shape[0]

The key in this question is to understand that different authors can create books with the same name. The first two options check for each unique book title (the first response) and check for each unique other (the second response). To ensure we have all unique author and title pairs we must group based on both 'Author' and 'Title'. To actually get the number of rows we can take .shape[0].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.



Problem 29

If we merge a table with n rows with a table with m rows, how many rows does the resulting table have?

Answer: not enough information to tell

The question does not provide enough information to know the resulting table size with certainty. When merging two tables together, the tables can be merged with a inner, left, right, and outer join. Each of these joins will produce a different amount of rows. Since the question does not provide the type of join, it is impossible to tell the resulting table size.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 30

Give an example of a dataset and a question you would want to answer about that dataset which you would answer by grouping with subgroups (using multiple columns in the groupby command). Explain how you would use the groupby command to answer your question.

Creative responses that are different than ones we’ve already seen in this class will earn the most credit.

Answer: There are many possible correct answers. Below are some student responses that earned full credit, lightly edited for clarity.


Consider the dataset of Olympic medals (Bronze, Silver, Gold) that a country won for a specific sport, with columns 'sport', 'country', 'medals'.

Question: In which sport did the US win the most medals?

We can group by country and then subgroup by sport. We can then use a combination of reset_index() and sort_values(by = 'medals') and then use .get and .iloc[-1] to get the our answer to the question.


Given a data set of cell phone purchase volume at every electronics store, we might want to find the difference in popularity of Samsung phones and iPhones in every state. I would use the groupby command to first group by state, followed by phone brand, and then aggregate with the sum() method. The resulting table would show the total iPhone and Samsung phone sales separately for each state which I could then use to calculate the difference in proportion of each brand’s sales volumes.


You are given a table called cars with columns: 'brands' (Toyota, Honda, etc.), 'model' (Prius, Accord, etc.), 'price' of the car, and 'fuel_type' (gas, hybrid, electric). Since you are environmentally friendly you only want cars that are electric, but you want to find the cheapest one. Find the brand that has the cheapest average price for an electric car.

You want to groupby on both 'brands' and 'fuel_type' and use the aggregate command mean() to find the average price per fuel type for each brand. Then you would find only the electric fuel types and sort values to find the cheapest.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 31


Problem 31.1

Consider the function unknown, defined below.

def unknown(df):
    grouped = plum.groupby('Opp').max().get(['Date', 'PTS'])
    return np.array(grouped.reset_index().index)[df]

What does unknown(3) evaluate to?

Answer: The number 3

plum.groupby('Opp').max() finds the largest value in the 'Date', 'Home', 'Won', 'PTS', 'AST', and 'TOV' columns for each unique 'Opp' (independently for each column). grouped = plum.groupby('Opp').max().get(['Date', 'PTS']) keeps only the 'Date' and 'PTS' columns. Note that in grouped, the index is 'Opp', the column we grouped on.

When grouped.reset_index() is called, the index is switched back to the default of 0, 1, 2, 3, 4, and so on. Then, grouped.reset_index().index is an Index containing the numbers [0, 1, 2, 3, 4, ...], and np.array(grouped.reset_index().index) is np.array([0, 1, 2, 3, 4, ...]). In this array, the number at position i is just i, so the number at position df is df. Here, df is the argument to unknown, and we were asked for the value of unknown(3), so the correct answer is the number at position 3 in np.array([0, 1, 2, 3, 4, ...]) which is 3.

Note that if we asked for unknown(50) (or unknown(k), where k is any integer above 30), the answer would be “An error”, since grouped could not have had 51 rows. plum has 31 rows, so grouped has at most 31 rows (but likely less, since Kelsey Plum’s team likely played the same opponent multiple times).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 31.2

For your convenience, we show the first few rows of plum again below.

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

Consider the DataFrame home_won, defined below.

home_won = plum.groupby(['Home', 'Won']).mean().reset_index()
  1. How many rows does home_won have?

  2. How many columns does home_won have?

Answer: 4 rows and 5 columns.

plum.groupby(['Home', 'Won']).mean() contains one row for every unique combination of 'Home' and 'Won'. There are two values of 'Home' - True and False – and two values of 'Won'True and False – leading to 4 combinations. We can assume that there was at least one row in plum for each of these 4 combinations due to the assumption given in the problem:

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

plum started with 7 columns: 'Date', 'Opp', 'Home', 'Won', 'PTS', 'AST', and 'TOV'. After grouping by ['Home', 'Won'] and using .mean(), 'Home' and 'Won' become the index. The resulting DataFrame contains all of the columns that the .mean() aggregation method can work on. We cannot take the mean of 'Date' and 'Opp', because those columns are strings, so plum.groupby(['Home', 'Won']).mean() contains a MultiIndex with 2 “columns” – 'Home' and 'Won' – and 3 regular columns – 'PTS' 'AST', and 'TOV'. Then, when using .reset_index(), 'Home' and 'Won' are restored as regular columns, meaning that plum.groupby(['Home', 'Won']).mean().reset_index() has 2 + 3 = 5 columns.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 31.3

Consider the DataFrame home_won once again.

home_won = plum.groupby(['Home', 'Won']).mean().reset_index()

Now consider the DataFrame puzzle, defined below. Note that the only difference between home_won and puzzle is the use of .count() instead of .mean().

puzzle = plum.groupby(['Home', 'Won']).count().reset_index()

How do the number of rows and columns in home_won compare to the number of rows and columns in puzzle?

Answer: home_won and puzzle have the same number of rows, but a different number of columns

All that changed between home_won and puzzle is the aggregation method. The aggregation method has no influence on the number of rows in the output DataFrame, as there is still one row for each of the 4 unique combinations of 'Home' and 'Won'.

However, puzzle has 7 columns, instead of 5. In the solution to the above subpart, we noticed that we could not use .mean() on the 'Date' and 'Opp' columns, since they contained strings. However, we can use .count() (since .count() just determines the number of non-NA values in each group), and so the 'Date' and 'Opp' columns are not “lost” when aggregating. Hence, puzzle has 2 more columns than home_won.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.



Problem 32


Problem 32.1

There is exactly one team in the WNBA that Plum’s team did not win any games against during the 2021 season. Fill in the blanks below so that never_beat evaluates to a string containing the three-letter code of that team.

never_beat = plum.groupby(__(a)__).sum().__(b)__
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. 'Opp'

  2. sort_values('Won').index[0]

The key insight here is that the values in the 'Won' column are Boolean, and when Boolean values are used in arithmetic they are treated as 1s (True) and 0s (False). The sum of several 'Won' values is the same as the number of wins.

If we group plum by 'Opp' and use .sum(), the resulting 'Won' column contains the number of wins that Plum’s team had against each unique opponent. If we sort this DataFrame by 'Won' in increasing order (which is the default behavior of sort_values), the row at the top will correspond to the 'Opp' that Plum’s team had no wins against. Since we grouped by 'Opp', team names are stored in the index, so .index[0] will give us the name of the desired team.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.



Problem 33


Problem 33.1

What type of visualization is best suited for visualizing the trend in the number of points Kelsey Plum scored per game in 2021?

Answer: Line chart

Here, there are two quantitative variables (number of points and game number), and one of them involves some element of time (game number). Line charts are appropriate when one quantitative variable is time.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 33.2

Fill in the blanks below so that total_june evaluates to the total number of points Kelsey Plum scored in June.

june_only = plum[__(a)__]
total_june = june_only.__(b)__
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. plum.get('Date').str.contains('-06-')

  2. get('PTS').sum()

To find the total number of points Kelsey Plum scored in June, one approach is to first create a DataFrame with only the rows for June. During the month of June, the 'Date' values contain '-06-' (since June is the 6th month), so plum.get('Date').str.contains('-06-') is a Series containing True only for the June rows and june_only = plum[plum.get('Date').str.contains('-06-')] is a DataFrame containing only the June rows.

Then, all we need is the sum of the 'PTS' column, which is given by june_only.get('PTS').sum().


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 33.3

For your convenience, we show the first few rows of plum again below.

There is exactly one team in the WNBA that Plum’s team did not win any games against during the 2021 season. Fill in the blanks below so that never_beat evaluates to a string containing the three-letter code of that team.

never_beat = plum.groupby(__(a)__).sum().__(b)__
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. 'Opp'

  2. sort_values('Won').index[0]

The key insight here is that the values in the 'Won' column are Boolean, and when Boolean values are used in arithmetic they are treated as 1s (True) and 0s (False). The sum of several 'Won' values is the same as the number of wins.

If we group plum by 'Opp' and use .sum(), the resulting 'Won' column contains the number of wins that Plum’s team had against each unique opponent. If we sort this DataFrame by 'Won' in increasing order (which is the default behavior of sort_values), the row at the top will correspond to the 'Opp' that Plum’s team had no wins against. Since we grouped by 'Opp', team names are stored in the index, so .index[0] will give us the name of the desired team.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.



Problem 34


Problem 34.1

The DataFrame plum contains Kelsey Plum’s stats for all games the Las Vegas Aces played in 2021. The first few rows of plum are shown below (though the full DataFrame has 31 rows, not 5):

Each row in plum corresponds to a single game. For each game, we have:

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

Consider the DataFrame home_won, defined below.

home_won = plum.groupby(['Home', 'Won']).mean().reset_index()
  1. How many rows does home_won have?

  2. How many columns does home_won have?

Answer: 4 rows and 5 columns.

plum.groupby(['Home', 'Won']).mean() contains one row for every unique combination of 'Home' and 'Won'. There are two values of 'Home' - True and False – and two values of 'Won'True and False – leading to 4 combinations. We can assume that there was at least one row in plum for each of these 4 combinations due to the assumption given in the problem:

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

plum started with 7 columns: 'Date', 'Opp', 'Home', 'Won', 'PTS', 'AST', and 'TOV'. After grouping by ['Home', 'Won'] and using .mean(), 'Home' and 'Won' become the index. The resulting DataFrame contains all of the columns that the .mean() aggregation method can work on. We cannot take the mean of 'Date' and 'Opp', because those columns are strings, so plum.groupby(['Home', 'Won']).mean() contains a MultiIndex with 2 “columns” – 'Home' and 'Won' – and 3 regular columns – 'PTS' 'AST', and 'TOV'. Then, when using .reset_index(), 'Home' and 'Won' are restored as regular columns, meaning that plum.groupby(['Home', 'Won']).mean().reset_index() has 2 + 3 = 5 columns.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.



Problem 35


Problem 35.1

What type of visualization is best suited for visualizing the trend in the number of points Kelsey Plum scored per game in 2021?

Answer: Line chart

Here, there are two quantitative variables (number of points and game number), and one of them involves some element of time (game number). Line charts are appropriate when one quantitative variable is time.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 35.2

Fill in the blanks below so that total_june evaluates to the total number of points Kelsey Plum scored in June.

june_only = plum[__(a)__]
total_june = june_only.__(b)__
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. plum.get('Date').str.contains('-06-')

  2. get('PTS').sum()

To find the total number of points Kelsey Plum scored in June, one approach is to first create a DataFrame with only the rows for June. During the month of June, the 'Date' values contain '-06-' (since June is the 6th month), so plum.get('Date').str.contains('-06-') is a Series containing True only for the June rows and june_only = plum[plum.get('Date').str.contains('-06-')] is a DataFrame containing only the June rows.

Then, all we need is the sum of the 'PTS' column, which is given by june_only.get('PTS').sum().


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 35.3

Consider the function unknown, defined below.

def unknown(df):
    grouped = plum.groupby('Opp').max().get(['Date', 'PTS'])
    return np.array(grouped.reset_index().index)[df]

What does unknown(3) evaluate to?

Answer: The number 3

plum.groupby('Opp').max() finds the largest value in the 'Date', 'Home', 'Won', 'PTS', 'AST', and 'TOV' columns for each unique 'Opp' (independently for each column). grouped = plum.groupby('Opp').max().get(['Date', 'PTS']) keeps only the 'Date' and 'PTS' columns. Note that in grouped, the index is 'Opp', the column we grouped on.

When grouped.reset_index() is called, the index is switched back to the default of 0, 1, 2, 3, 4, and so on. Then, grouped.reset_index().index is an Index containing the numbers [0, 1, 2, 3, 4, ...], and np.array(grouped.reset_index().index) is np.array([0, 1, 2, 3, 4, ...]). In this array, the number at position i is just i, so the number at position df is df. Here, df is the argument to unknown, and we were asked for the value of unknown(3), so the correct answer is the number at position 3 in np.array([0, 1, 2, 3, 4, ...]) which is 3.

Note that if we asked for unknown(50) (or unknown(k), where k is any integer above 30), the answer would be “An error”, since grouped could not have had 51 rows. plum has 31 rows, so grouped has at most 31 rows (but likely less, since Kelsey Plum’s team likely played the same opponent multiple times).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 35.4

For your convenience, we show the first few rows of plum again below.

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

Consider the DataFrame home_won, defined below.

home_won = plum.groupby(['Home', 'Won']).mean().reset_index()
  1. How many rows does home_won have?

  2. How many columns does home_won have?

Answer: 4 rows and 5 columns.

plum.groupby(['Home', 'Won']).mean() contains one row for every unique combination of 'Home' and 'Won'. There are two values of 'Home' - True and False – and two values of 'Won'True and False – leading to 4 combinations. We can assume that there was at least one row in plum for each of these 4 combinations due to the assumption given in the problem:

Suppose that Plum’s team, the Las Vegas Aces, won at least one game in Las Vegas and lost at least one game in Las Vegas. Also, suppose they won at least one game in an opponent’s arena and lost at least one game in an opponent’s arena.

plum started with 7 columns: 'Date', 'Opp', 'Home', 'Won', 'PTS', 'AST', and 'TOV'. After grouping by ['Home', 'Won'] and using .mean(), 'Home' and 'Won' become the index. The resulting DataFrame contains all of the columns that the .mean() aggregation method can work on. We cannot take the mean of 'Date' and 'Opp', because those columns are strings, so plum.groupby(['Home', 'Won']).mean() contains a MultiIndex with 2 “columns” – 'Home' and 'Won' – and 3 regular columns – 'PTS' 'AST', and 'TOV'. Then, when using .reset_index(), 'Home' and 'Won' are restored as regular columns, meaning that plum.groupby(['Home', 'Won']).mean().reset_index() has 2 + 3 = 5 columns.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 35.5

Consider the DataFrame home_won once again.

home_won = plum.groupby(['Home', 'Won']).mean().reset_index()

Now consider the DataFrame puzzle, defined below. Note that the only difference between home_won and puzzle is the use of .count() instead of .mean().

puzzle = plum.groupby(['Home', 'Won']).count().reset_index()

How do the number of rows and columns in home_won compare to the number of rows and columns in puzzle?

Answer: home_won and puzzle have the same number of rows, but a different number of columns

All that changed between home_won and puzzle is the aggregation method. The aggregation method has no influence on the number of rows in the output DataFrame, as there is still one row for each of the 4 unique combinations of 'Home' and 'Won'.

However, puzzle has 7 columns, instead of 5. In the solution to the above subpart, we noticed that we could not use .mean() on the 'Date' and 'Opp' columns, since they contained strings. However, we can use .count() (since .count() just determines the number of non-NA values in each group), and so the 'Date' and 'Opp' columns are not “lost” when aggregating. Hence, puzzle has 2 more columns than home_won.


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 35.6

For your convenience, we show the first few rows of plum again below.

There is exactly one team in the WNBA that Plum’s team did not win any games against during the 2021 season. Fill in the blanks below so that never_beat evaluates to a string containing the three-letter code of that team.

never_beat = plum.groupby(__(a)__).sum().__(b)__
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. 'Opp'

  2. sort_values('Won').index[0]

The key insight here is that the values in the 'Won' column are Boolean, and when Boolean values are used in arithmetic they are treated as 1s (True) and 0s (False). The sum of several 'Won' values is the same as the number of wins.

If we group plum by 'Opp' and use .sum(), the resulting 'Won' column contains the number of wins that Plum’s team had against each unique opponent. If we sort this DataFrame by 'Won' in increasing order (which is the default behavior of sort_values), the row at the top will correspond to the 'Opp' that Plum’s team had no wins against. Since we grouped by 'Opp', team names are stored in the index, so .index[0] will give us the name of the desired team.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 35.7

Recall that plum has 31 rows, one corresponding to each of the 31 games Kelsey Plum’s team played in the 2021 WNBA season.

Fill in the blank below so that win_bool evaluates to True.

def modify_series(s):
    return __(a)__

n_wins = plum.get('Won').sum()
win_bool = n_wins == (31 + modify_series(plum.get('Won')))

What goes in blank (a)?

Answer: -(s == False).sum()

n_wins equals the number of wins that Plum’s team had. Recall that her team played 31 games in total. In order for (31 + modify_series(plum.get('Won'))) to be equal to her team’s number of wins, modify_series(plum.get('Won')) must be equal to her team’s number of losses, multiplied by -1.

To see this algebraically, let modified = modify_series(plum.get('Won')). Then:

31 + \text{modified} = \text{wins} \text{modified} = \text{wins} - 31 = -(31 - \text{wins}) = -(\text{losses})

The function modified_series(s) takes in a Series containing the wins and losses for each of Plum’s team’s games and needs to return the number of losses multiplied by -1. s.sum() returns the number of wins, and (s == False).sum() returns the number of losses. Then, -(s == False).sum() returns the number of losses multiplied by -1, as desired.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.



Problem 36

Suppose we have access to another DataFrame, new_york, that contains the latitude and longitude of every single skyscraper in New York City that is also in sky. The first few rows of new_york are shown below.

Below, we define a new DataFrame, sky_with_location, that merges together both sky and new_york.

sky_with_location = sky.merge(new_york, left_index=True, right_on='name')

Given that:

select the true statement below.

Answer: sky_with_location has exactly n rows.

Here, we are merging sky and new_york on skyscraper names (stored in the index in sky and in the 'name' column in new_york). The resulting DataFrame, sky_with_location, will have one row for each “match” between skyscrapers in sky and new_york. Since skyscraper names are presumably unique, sky_with_location will have one row for each skyscraper that is in both sky and new_york.

The skyscrapers that are in both sky and new_york are just the skyscrapers in new_york, since all of the non-New York skyscrapers in sky won’t be in new_york. As such, sky_with_location has the same number of rows as new_york. new_york has n rows, so sky_with_location also has n rows.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Source: wi23-final — Q15

Problem 37

At a recent game night, you played several new board games and liked them so much that you now want to buy copies for yourself.

The DataFrame stores is shown below in full. Each row represents a game you want to buy and a local board game store where that game is available for purchase. If a game is not available at a certain store, there will be no row corresponding to that store and that game.


Problem 37.1

The DataFrame prices has five rows. Below we merge stores with prices and display the output in full.

merged = stores.merge(prices, on="Game")
merged


In the space below, specify what the DataFrame prices could look like. The column labels should go in the top row, and the row labels (index) should go in the leftmost row. You may not need to use all the columns provided, but you are told that prices has five rows, so you should use all rows provided.

Note: There are several correct answers to this question.

Answer:

We can use the merged DataFrame to figure out the prices that correlate to each game in stores. We see in merged the price for Tickets to Ride should be 47.99, so we create a row for that game. We repeat this process to find the remaining rows. Since we know that prices have 5 rows we then make a game and price up. Note that in the solution above the last row (index 4) has “Sushi Go” and 9.99. These can be any game or any price that is not listed in indexes 0 to 4. This is because prices has 5 rows and when we use .merge() since the game “Sushi Go” is not in stores it will not be added.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Suppose merged now contains all the available games and their corresponding prices at each store (in particular, a given game is sold for the same price at all stores). You want to buy as many games as possible but you only want to go to one store. Which store should you go to maximize the number of games you can buy?

Fill in the blanks so that where_to_go evaluates to the name of the store you should buy your games from.

where_to_go = (merged.groupby("Store").__(a)__.sort_values(by="Price", ascending=False).__(b)__)


Problem 37.2

What goes in blank (a)?

Answer: count()

The problem asks us which store would allow us to buy as many games as possible. The provided code is merge.groupby(“Store”).__a__. We want to use the aggregate method that allows us to find the number of games in each store. The aggregation method for this would be count().


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 37.3

What goes in blank (b)?

Answer: index[0]

Recall groupby() will cause the unique values from the column “Store” to be in the index. The remaining part of the code sorts the DataFrame so that the store with the most games is at the top. This means the row at index 0 has the store and most number of games inside of the DataFrame. To grab the element at the 1st index we simply do index[0].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.


Problem 37.4

Suppose you go to the store where_to_go and buy one copy of each of the available games that you enjoyed at game night. How much money will you spend? Write one line of code that evaluates to the answer, using the merged DataFrame and no others.

Answer: merged[merged.get(“Store”) == where_to_go].get(“Price”).sum()

We want to figure out how much money we would spend if we went to where_to_go, which is the store where we can buy as many games as possible. We can simply query the merged DataFrame to only contain the rows where the store is equal to where_to_go. We then can simply get the “Price” column and add all of the values up by doing .sum() on the Series.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.



Source: wi23-midterm — Q9

Problem 38


Problem 38.1

The DataFrame directors contains historical information about the director of the National Hurricane Center (NHC). A preview of directors is shown below.


We would like to merge storms with directors to produce a DataFrame with the same information as storms plus one additional column with the name of the director who was leading the NHC at the time of each storm. However, when we try to merge with the command shown below, Python fails to produce the desired DataFrame.

directors.merge(storms, left_on="Tenure", right_on="Year")

Which of the following is a problem with our attempted merge? Select all that apply.

Answer: Option 3 and Option 4

Recall that left_df.merge(right_df, left_on='column_a', right_on='column_b') merges left_df to the right_df and specifies which columns from the DataFrame to use as keys by using left_on= and right_on=. This means that column_a becomes the key for the left DataFrame and column_b becomes the key for the right DataFrame. That means the column names do not need to be the same. The important part of this is 'column_a' and 'column_b' should be the same data type and contain the same information for the merge to be successful.

Option 4 is correct because the years are formatted differenntly in storms and in directors. In storms the column "Year" contains an int, which is the year, whereas in "Tenure" the column contains a string to represent a span of years. When we try to merge there is no overlap between values in these columns. There will actually be an error because we are trying to merge two columns of different types.

Option 3 is correct because the merge will fail to happen due to the error we see caused by the columns containing values with no overlap.

Option 1: Is incorrect because you can merge DataFrames with different column names using left_on and right_on.

Option 2: Is incorrect because regardless of the left or right DataFrames if done correctly they will merge together. This means the order of the DataFrames does not make an impact.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.



Source: wi23-midterm — Q10

Problem 39


Problem 39.1

Recall that all the named storms in storms occurred between 1965 and 2015, a fifty-year time period.

Below is a histogram and the code that produced it. Use it to answer the questions that follow.

    (storms.groupby(["Name", "Year"]).count()
        .reset_index()
        .plot(kind="hist", y="Year",
            density=True, ec="w",
            bins=np.arange(1965, 2020, 5)));


Approximately (a) percent of named storms in this fifty-year time period occurred in 1995 or later. Give your answer to the nearest multiple of five.

Answer: 55

We can find the percentage of named storms by using the fact: a histograms’ area is always normalized to 1. This means we can calculate the area of the rectangle, also known as width * height, to the left of 1995. The height, which is the frequency, is about 0.015 and the width is the difference between 1995 and 1965. This gives us: 1 - (1995 - 1965) * 0.015 = 0.55. Next to convert this to a percentage we multiply by 100, giving us: 0.55 * 100 = 55\%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.



Problem 39.2

True or False? The line plot generated by the code below will have no downward-sloping segments after 1995.

    (storms.groupby(["Name", "Year"]).count()
        .reset_index()
        .groupby("Year").count()
        .plot(kind="line", y="Name");

Answer: False

The previous histogram shows upward-sloping segments for each 5-year period after 1995; however, we are now ploting a line graph that shows a continuous timeline. Thus, there may be downward-sloping that happens within any 5-year periods.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.



Source: wi23-midterm — Q11

Problem 40


Problem 40.1

The code below defines a variable called month formed.

month_formed = (storms.groupby(["Name", "Year"]).min()
                      .reset_index()
                      .groupby("Month").count()
                      .get("Name"))

What is the data type of month formed?

Answer: Series

It’s helpful to analyze the code piece by piece. The first part is doing .groupby(["Name", "Year"]).min(), which will index both "Name" and "Year" and find the minimum values in the DataFrame. We are still working with a DataFrame at this point. The next part .reset_index() makes "Name" and "Year" columns again. Again, this is a DataFrame. The next part .groupby("Month").count() makes "Month" the index and gets the count for each element in the DataFrame. Finally, .get("Name") isolates the "Name" column and returns to month_formed a series.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Problem 40.2

Which of the following expressions evaluates to the proportion of storms in our data set that were formed in August?

Answer: month_formed.loc[8]/month_formed.sum()

Option 1: Recall that August is the eigth month, so using .loc[8] will find the label 8 in month_formed, which will be counts or the number of storms formed in August. Dividing the number of storms formed in August by the total number of storms formed will give us the proportion of storms that formed in August.

Option 2: It is important to realize that the months have become the index of month_formed, but that doesn’t necessarily mean that the index starts in January or that there have been storm during a month before August. For example if there were no storms in March then there would be no 3 in the index. Recall .iloc[7] is indexing for whatever is in position 7, but because the index is not guaranteed we cannot be certain the .iloc[7] will return August.

Option 3: The code: month_formed[month_formed.index == 8].shape[0] will return 1. Finding the index at month 8 will give us August, but doing .shape[0] gives us the number of rows in August, which should only be 1 because of groupby. This means that Option 3’s line of code will not give us the number of storms that formed in August, which makes it impossible to find the propotion.

Option 4: Remember that months_formed’s index is "Month". This means that there is no column "Month", so the code will error, meaning it cannot give us proportions of storms that formed in August.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 35%.



Source: wi24-final — Q9

Problem 41

In 2024, the Olympics will include breaking (also known as breakdancing) for the first time. The breaking competition will include 16 athletes, who will compete in a single-elimination tournament.

In the first round, all 16 athletes will compete against an opponent in a face-to-face “battle". The 8 winners, as determined by the judges, will move on to the next round. Elimination continues until the final round contains just 2 competitors, and the winner of this final battle wins the tournament.

The table below shows how many competitors participate in each round:

After the 2024 Olympics, suppose we make a DataFrame called breaking containing information about the performance of each athlete during each round. breaking will have one row for each athlete’s performance in each round that they participated. Therefore, there will be 16+8+4+2 = 30 rows in breaking.

In the "name" column of breaking, we will record the athlete’s name (which we’ll assume to be unique), and in the other columns we’ll record the judges’ scores in the categories on which the athletes will be judged (creativity, personality, technique, variety, performativity, and musicality).


Problem 41.1

How many rows of breaking correspond to the winner of the tournament? Give your answer as an integer.

Answer: 4

Since the winner of the tournament must have won during the 1st, 2nd, 3rd, and final rounds, there will be a total of four rows in breaking corresponding to this winner.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 41.2

How many athletes’ names appear exactly twice in the "name" column of breaking? Give your answer as an integer.

Answer: 4

For an athlete to appear on exactly two rows in breaking, they must get through the 1st round but get eliminated in the 2nd round. There are a total of 8 athletes in the 2nd round, of which 4 are eliminated.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 41.3

If we merge breaking with itself on the "name" column, how many rows will the resulting DataFrame have? Give your answer as an integer.

Hint: Parts (a) and (b) of this question are relevant to part (c).

Answer: 74

This question asks us the number of rows in the DataFrame that results from merging breaking with itself on the name column. Let’s break this problem down.

Concept:

When a DataFrame is merged with another DataFrame, there will be one row in the output DataFrame for every matching value between the two DataFrames in the row you’re merging on. In general, this means you can calculate the number of rows in the output DataFrame the following way:

\text{Number of instances of \texttt{a} in \texttt{df1} } \cdot \text{ Number of instances of \texttt{a} in \texttt{df2}} +

\text{Number of instances of \texttt{b} in \texttt{df1}} \cdot \text{Number of instances of \texttt{b} in \texttt{df2}} +

\vdots

\text{Number of instances of \texttt{n} in \texttt{df1}} \cdot \text{ Number of instances of \texttt{n} in \texttt{df2}}

For example, if there were 2 instances of "Jack" in df1 and 3 instances of "Jack" in df2, there would be 2 \cdot 3 = 6 instances of "Jack" in the output DataFrame.

So, when we’re merging a DataFrame with itself, you can calculate the number of rows in the output DataFrame the following way:

\text{Number of instances of \texttt{a} in \texttt{df1} } \cdot \text{ Number of instances of \texttt{a} in \texttt{df1} } +

\text{Number of instances of \texttt{b} in \texttt{df1}} \cdot \text{Number of instances of \texttt{b} in \texttt{df1} } +

\vdots

\text{Number of instances of \texttt{n} in \texttt{df1}} \cdot \text{ Number of instances of \texttt{n} in \texttt{df1} }

which is the same as

\left(\text{Number of instances of \texttt{a} in \texttt{df1} }\right)^2

\left(\text{Number of instances of \texttt{b} in \texttt{df1} }\right)^2

\vdots

\left(\text{Number of instances of \texttt{n} in \texttt{df1} }\right)^2

The Problem:

So, if we can figure out how many instances of each athlete are in breaking, we can calculate the number of rows in the merged DataFrame by squaring these numbers and adding them together.

As it turns out, we can absolutely do this! Consider the following information, drawn from the DataFrame:

  1. 8 athletes are eliminated in the first round, thus only appearing one time in breaking (in the rows corresponding to round one).
  2. 4 athletes are eliminated in the second round, appearing twice in breaking (in the rows corresponding to rounds one and two).
  3. 2 athletes are eliminated in the third round, appearing three times in breaking (in the rows corresponding to rounds one, two, and three).
  4. 2 athletes reach the fourth round, appearing four times in breaking (in the rows corresponding to rounds one, two, three, and four).

Using the formula above for calculating the number of rows in the output DataFrame when merging a DataFrame with itself:

For 8 athletes, they will appear 1^2 = 1 time in the merged DataFrame. For 4 athletes, they will appear 2^2 = 4 times in the merged DataFrame. For 2 athletes, they will appear 3^2 = 9 times in the merged DataFrame. And for the last 2 athletes, they will appear 4^2 = 16 times in the merged DataFrame.

So, the total number of rows in the merged DataFrame is

8(1) + 4(4) + 2(9) + 2(16) = 8 + 16 + 18 + 32 = 74 rows.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.


Problem 41.4

Recall that the number of competitors in each round is 16, 8, 4, 2. Write one line of code that evaluates to the array np.array([16, 8, 4, 2]). You must use np.arange in your solution, and you may not use np.array or the DataFrame breaking.

Answer: 2 ** np.arange(4, 0, -1)

This problem asks us to write one line of code that evaluates to the array np.array([16, 8, 4, 2]).

Concept:

Right away, it should jump out at you that these are powers of 2 in reverse order. Namely,

[2^4, 2^3, 2^2, 2^1].

The key insight is that exponentiation works element-wise on an array. In otherwords:

2 ** [4, 3, 2, 1] = [2^4, 2^3, 2^2, 2^1].

Given this information, it is simply a matter of constructing a call to np.arange() that resuts in the array [4, 3, 2, 1]. While there are many calls that achieve this outcome, one example is with the call np.arange(4, 0, -1).

So, the full expression that evaluates to np.array([16, 8, 4, 2]) is 2 ** np.arange(4, 0, -1)


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 38%.



Source: wi24-midterm — Q7

Problem 42

Part of the gameplay of Clue involves moving around the gameboard. The gameboard has 9 rooms, arranged on a grid, and players roll dice to determine how many spaces they can move.

The DataFrame dist contains a row and a column for each of the 9 rooms. The entry in row r and column c represents the shortest distance between rooms r and c on the Clue gameboard, or the smallest dice roll that would be required to move between rooms r and c. Since you don’t need to move at all to get from a room to the same room, the entries on the diagonal are all 0.

dist is indexed by "Room", and the room names appear exactly as they appear in the index of the clue DataFrame. These same values are also the column labels in dist.


Problem 42.1

Two of the following expressions are equivalent, meaning they evaluate to the same value without erroring. Select these two expressions.

Explain in one sentence why these two expressions are the same.

Answer: dist.get("kitchen").loc["library"] and dist.get("library").loc["kitchen"]

dist.get("kitchen").iloc["library"] and dist.get("library").iloc["kitchen"] are both wrong because they uses iloc inappropriately. iloc[] takes in an integer number representing the location of column, row, or cell you would like to extract and it does not take a column or index name.

dist.get("kitchen").loc["library"] and dist.get("library").loc["kitchen"] lead to the same answer because the DataFrame has a unique property! The entry at r, c is the same as the entry at c, r because both are the distances for the same two rooms. The distance from the kitchen to library is the same as the distance from the library to kichen.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 42.2

On the Clue gameboard, there are two “secret passages." Each secret passage connects two rooms. Players can immediately move through secret passages without rolling, so in dist we record the distance as 0 between two rooms that are connected with a secret passage.

Suppose we run the following code.

    nonzero = 0
    for col in dist.columns:
        nonzero = nonzero + np.count_nonzero(dist.get(col))

Determine the value of nonzero after the above code is run.

Answer: nonzero = 68

The nonzero variable represents the entries in the DataFrame where the distance between two rooms is not 0. There are 81 entries in the DataFrame because there are 9 rooms and 9 \cdot 9 = 81. Since the diagonal of the DataFrame is 0 (due to the distance from a room to itself being 0), we know there are at most 72 = 81 - 9 nonzero entries in the DataFrame.

We are also told that there are 2 secret passages, each of which connects 2 different rooms, meaning the distance between these rooms is 0. Each secret passage will cause 2 entries in the DataFrame to have a distance of 0. For instance, if the secret passage was between the kitchen and dining room, then the distance from the kitchen to the dining room would be 0, but also the distance from the dining room to the kitchen would be 0. Since there are 2 secret passages and each gives rise to 2 entries that are 0, this is 4 additional entries that are 0. This means there are 68 nonzero entries in the DataFrame, coming from 81 - 9 - 4 = 68.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 28%.


Problem 42.3

Fill in blanks so that the expression below evaluates to a DataFrame with all the same information as dist, plus one extra column called "Cardholder" containing Janine’s knowledge of who holds each room card.

    dist.merge(___(a)___, ___(b)___, ___(c)___)
  1. What goes in blank (a)?

  2. What goes in blank (b)?

  3. What goes in blank (c)?

Answer:

  • (a): clue.get(["Cardholder"])
  • (b): left_index=True
  • (c): right_index=True

Since we want to create a DataFrame that looks like dist with an extra column of "Cardholder", we want to extract just that column from clue to merge with dist. We do this with clue.get(["Cardholder"]). This is necessary because when we merge two DataFrames, we get all columns from either DataFrame in the end result.

When deciding what columns to merge on, we need to look for columns from each DataFrame that share common values. In this case, the common values in the two DataFrames are not in columns, but in the index, so we use left_index=True and right_index=True.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 28%.


Problem 42.4

Suppose we generate a scatter plot as follows.

    dist.plot(kind="scatter", x="kitchen", y="study");

Suppose the scatterplot has a point at (4, 6). What can we conclude about the Clue gameboard?

Answer: Another room besides the kitchen is 6 spaces away from the study.

Let’s explain each choice and why it is correct or incorrect. The scatterplot shows how far a room is from the kitchen (as shown by values on the x-axis) and how far a room is from the study (as shown by the values on the y-axis). Each room is represented by a point. This means there is a room that is 4 units away from the kitchen and 6 units away from the study. This room can’t be the kitchen or study itself, since a room must be distance 0 from itself. Therefore, we conclude, based on the y-coordinate, that there is a room besides the kitchen that is 6 units away from the study.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 47%.



Source: wi25-final — Q9

Problem 43

The Triwizard Tournament is an international competition between three wizarding academies: Hogwarts, Durmstrang, and Beauxbatons.

In a Triwizard Tournament, wizards from each school compete in three dangerous magical challenges. If one school wins two or more challenges, that school is the tournament champion. Otherwise, there is no champion, since each school won a single challenge.

The DataFrame triwiz has a row for each challenge from the first 20 Triwizard Tournaments. With 20 tournaments each having 3 challenges, triwiz has exactly 60 rows. The first six rows are shown below.

The columns are:


Problem 43.1

(10 pts) Fill in the blanks below to create the DataFrame champions, which is indexed by "Winner" and has just one column, "Year", containing the number of years in which each school was the tournament champion. champions is shown in full below.

Note that the values in the "Year" column add up to 14, not 20. That means there were 6 years in which there was a tie (for example, 1299 was one such year).

    grouped = triwiz.groupby(__(a)__).__(b)__.__(c)__
    filtered = grouped[__(d)__] 
    champions = filtered.groupby(__(e)__).__(f)__.__(g)__

What goes in blank (a)?

Answer: ["Year", "Winner"] or ["Winner", "Year"]

Grouping by both the "Year" and "Winner" columns ensures that each school’s win in a given year is represented as a single row.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 43.2

What goes in blank (b)?

Answer: count()

Since each winner in a given year appears as a single row, we use count() to determine how many times each school won that year.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 43.3

What goes in blank (c)?

Answer: reset_index()

Grouping by multiple columns creates a multi-index. reset_index() flattens the DataFrame back to normal, where each row represents a given year for each winning school.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 43.4

What goes in blank (d)?

Answer: grouped.get("Host") != 1 or grouped.get("Host") > 1 or grouped.get("Host") >= 2

The Triwizard Tournament winner is defined as a school that wins two or more challenges in a given year. After grouping with .count(), all other columns contain the same value, which is the number of challenges each winning school won that year. A school will not appear in this DataFrame if they did not win any challenges that year, so we only need to check if the value in the other columns is not 1.
.get("Challenge") is also valid because all other columns contain the same value.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Problem 43.5

What goes in blank (e)?

Answer: "Winner"

The resulting DataFrame should be indexed by "Winner", therefore the DataFrame is grouped by the "Winner" column.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 43.6

What goes in blank (f)?

Answer: count()

Grouping with .count() again ensures that the resulting columns represent the number of times each "Winner" (school) in the index won across all years.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 43.7

What goes in blank (g)?

Answer: get(["Year"])

The question asks for a DataFrame with "Year" as the only column, so brackets are used around "Year" to ensure the output is a DataFrame rather than a Series.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 43.8

How many rows are in the DataFrame that results from merging triwiz with itself on "Year"? Give your answer as an integer.

Answer: 180

The most important part of this question is understanding how merge works in babypandas. Start by reviewing this diagram from lecture.

When we merge two DataFrames together by "Year", we are matching every row in triwiz with every other row that has the same value in the "Year" column. This means that for each year, we’ll match all the rows from that year with each other. Since there are three challenges per year, that means that each year appears 3 times in the DataFrame. Since we are matching all rows from each year with each other, this means we will end up with 3 * 3 or 9 rows per year. Since there are 20 years in the DataFrame, we can multiply these together to get 180 total rows in the merged DataFrame.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 43.9

How many rows are in the DataFrame that results from merging triwiz with itself on "Challenge"? Give your answer as an integer.

Answer: 1200

Similar to the previous part, we are now matching all rows from a given challenge to each other. There are 3 challenges per tournament, so the values in the "Challenge" column are 1, 2, and 3. Each such values appears 20 times, once for each year. As a result, for each of the 3 challenges there are 20 * 20 or 400 rows. Therefore, we have 400 * 3 = 1200 rows total.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 43.10

How many rows are in the DataFrame that results from merging triwiz with itself on "Host"? Select the expression that evaluates to this number.

Answer: 2 * 21^2 + 18^2

The key to understanding how this problem works is by understanding how many times each school ends up hosting the tournament within this dataset. It is stated that the host is determined on a rotating basis. Based on the DataFrame description, we know the order is Hogwarts, Durmstrang, and then Beauxbatons. Since there are only 20 years in this dataset, the last school in the rotation will have one less host than the other two schools. Thus, we have determined that Hogwarts hosts 7 times, Durmstrang hosts 7 times, and Beauxbatons hosts 6 times. Since for each year a school hosts, they appear three times in the DataFrame, each school appears 21 times, 21 times, and 18 times respectively. As stated in the above questions when merging we are matching all rows from a given host to each other. Therefore, the total rows can be expressed as 21^2 + 21^2 + 18^2. This matches the last answer choice.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 43%.



Source: wi25-midterm — Q2

Problem 44

As a broke college student, you are on a mission to find the dining hall with the greatest number of affordable menu items.


Problem 44.1

To begin, you want a DataFrame with the same columns as dining, but with an additional column "Affordability" which classifies each menu item as follows:

  1. "Cheap", for items that cost $6 or less.

  2. "Moderate", for items that cost more than $6 and at most $14.

  3. "Expensive", for items that cost more than $14.

Fill in the blanks below to assign this new DataFrame to the variable with_affordability.

    def categorize(price):
        price_as_float = __(a)__  
        if price_as_float __(b)__:
            return __(c)__
        elif price_as_float > 6:
            return "Moderate"
        else:
            return __(d)__
    with_affordability = dining.assign(Affordability = __(e)__)

Answer (a): float(price.strip("$")) or float(price.replace("$", "")) or float(price.split("$")[1])

To solve this problem, we must keep in mind that the prices in the "Price" column are formatted as strings with dollar signs in front of them. For example, we might see a value such as "$9.50". Our function’s goal is to transform any given value in the "Price" column into a float matching the corresponding dollar amount.

Therefore, one strategy is to first use the .strip string method to strip the price of the initial dollar sign. Since we need our output as a float, we can call the python float function on this to get the price as a float.

The other strategies are similar. We can replace the dollar sign with the empty string, effectively removing it, then convert to a float. Similarly, we can split the price according to the instances of "$". This will return a list of two elements, the first of which contains everything before the dollar sign, which is the empty string, and the second of which contains everything after the dollar sign, which is the part we want. We extract it with [1] and then convert the answer to a float.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.

Answer (b): > 14

The key to getting this question correct comes from understanding how the placement of price_as_float > 6 affects how you order the rest of the conditionals within the function. The method through which we test if a price is "Moderate" is by checking if price_as_float > 6. But "Moderate" is defined as being not only more than 6 dollars but also less that or equal to 14 dollars. So we must first check to see if the price is "Expensive" before we check and see if the price is "Moderate". This way, prices that are in the "Expensive" range get caught by the first if statement, and then prices in the "Moderate" range get caught by the elif.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.

Answer (c): "Expensive"

In the previous part of this question we checked to see if the price_as_float > 14. We did this to see if the given price falls into the "Expensive" category. Since the categorize function is meant to take in a price and output the corresponding category, we just need to output the correct category for this conditional which is "Expensive".


Difficulty: ⭐️⭐️

The average score on this problem was 75%.

Answer (d): "Cheap"

In the previous parts of this question, we have implemented checks to see if a price is "Expensive" or "Moderate". This leaves only the "Cheap" price category, meaning that it can be placed in the else statement. The logic is we check if a price is "Expensive" or "Moderate" and if its neither, it has to be "Cheap".


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.

Answer (e): dining.get("Price").apply(categorize)

Now that we have implemented the categorize function, we need to apply it to the "Price" column so that it can be added as a new column into dining. Keep in mind our function takes in one price and outputs its price category. Thus, we need a way to apply this function to every value in the Series that corresponds to the "Price" column. To do this, we first need to get the Series from the DataFrame and then use apply on that Series with our categorize function as the input.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 44.2

Now, you want to determine, for each dining hall, the number of menu items that fall into each affordability category. Fill in the blanks to define a DataFrame called counts containing this information. counts should have exactly three columns, named "Dining Hall", "Affordability", and "Count".

    counts = with_affordability.groupby(__(f)__).count().reset_index()
    counts = counts.assign(Count=__(g)__).__(h)__

Answer (f): ["Dining Hall", "Affordability"] or ["Affordability", "Dining Hall"]

The key to solving this problem comes from understanding exactly what the question is asking. You are asked to create a DataFrame that displays, in each dining hall, the number of menu items that fall into each affordability category. This indicates that you will need to group by both “Dining Hall” and “Affordability” because when you group by multiple columns you get a row for every combination of values in those columns. Rembemer that to group by multiple columns, you need to input the columns names in a list as an argument in the groupby function. The order of the columns does not matter.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Answer (g): counts.get("Price") or counts.get("Calories") or counts.get("Item")

For this part, you need to create a column called "Count" with the number of menu items that fall into each affordability category within each dining hall. The .count() aggregation method works by counting the number of values in each column within each group. In the resulting DataFrame, all the columns have the same values, so we can use the values from any one of the non-index columns in the grouped DataFrame. In this case, those columns are "Price", "Calories", and "Item". Therefore to fill in the blank, you would get any one of these columns.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.

Answer (h): get(["Dining Hall", "Affordability", "Count"])or drop(columns=["Item", "Price", "Calories"]) (columns can be in any order)

Recall the question asked to create a DataFrame with only three columns. You have just added one of the columns, "Count. This means you have to drop the remaining columns ("Price", "Calories", and "Item") or simply get the desired columns ("Dining Hall", "Affordability", and "Count").


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 44.3

Suppose you determine that "The Bistro" is the dining hall with the most menu items in the "Cheap" category, so you will drag yourself there for every meal. Which of the following expressions must evaluate to the number of "Cheap" menu items available at "The Bistro"? Select all that apply.

Answer: (counts[counts.get("Affordability") == "Cheap"].sort_values(by="Count").get("Count").iloc[-1]) and counts[(counts.get("Affordability") == "Cheap") & (counts.get("Dining Hall") == "The Bistro")].get("Count").iloc[0]

  • Option 1: This code sorts all rows by "Count" in descending order (largest to smallest) and selects the count from the first row. This is incorrect because there can be another combination of "Dining Hall" and "Affordability" that has a number of menu items larger than the number of "Cheap" items at "The Bistro". For example, maybe there are 100 "Cheap" items at "The Bistro" but 200 "Moderate" items at "Pines".

  • Option 2: This code selects the largest value in the "Count" column, which may not necessarily be the number of "Cheap" items at "The Bistro". Taking the same example as above, maybe there are 100 "Cheap" items at "The Bistro" but 200 "Moderate" items at "Pines".

  • Option 3: This code first filters counts to only include rows where "Affordability" is "Cheap". Then, it sorts by "Count" in ascending order (smallest to largest). Since the question states that "The Bistro" has the most "Cheap" menu items, selecting the last row (iloc[-1]) from the "Count" column correctly retrieves the number of "Cheap" menu items at "The Bistro".

  • Option 4: This code filters counts to only include rows where "Dining Hall" is "The Bistro", then returns the maximum value in the "Count" column. However, "The Bistro" may have more menu items in the "Moderate" or "Expensive" categories than it does in the "Cheap" category. This query does not isolate "Cheap" items, so it incorrectly returns the highest count across all affordability levels rather than just "Cheap".

For example, if "The Bistro" has 100 "Cheap" menu items, 200 "Moderate" menu items, and 50 "Cheap" menu items, this code would evaluate to 200, even though that is not the number of "Cheap" items at "The Bistro". This is possible because we are told that "The Bistro" is the dining hall with the most menu items in the "Cheap" category, meaning in this example that every other dining hall has fewer than 100 "Cheap" items. However, this tells us nothing about the number of menu items at "The Bistro" in the other affordability categories, which can be greater than 100.

  • Option 5: This code queries to find the rows that correspond to "Cheap" affordabaility, and a "Dining Hall" value of "The Bistro". This is only one such row! The code then selects the first (and only) value in the "Count" column, which correctly evaluates to the number of "Cheap" items at "The Bistro".

Difficulty: ⭐️⭐️

The average score on this problem was 84%.



Source: wi25-midterm — Q5

Problem 45

It’s the grand opening of UCSD’s newest dining attraction: The Bread Basket! As a hardcore bread enthusiast, you celebrate by eating as much bread as possible. There are only a few menu items at The Bread Basket, shown with their costs in the table below:

Bread Cost
Sourdough 2
Whole Wheat 3
Multigrain 4

Suppose you are given an array eaten containing the names of each type of bread you ate.

For example, eaten could be defined as follows:

    eaten = np.array(["Whole Wheat", "Sourdough", "Whole Wheat", 
                      "Sourdough", "Sourdough"])

In this example, eaten represents five slices of bread that you ate, for a total cost of \$12. Pricey!

In this problem, you’ll calculate the total cost of your bread-eating extravaganza in various ways. In all cases, your code must calculate the total cost for an arbitrary eaten array, which might not be exactly the same as the example shown above.


Problem 45.1

One way to calculate the total cost of the bread in the eaten array is outlined below. Fill in the missing code.

    breads = ["Sourdough", "Whole Wheat", "Multigrain"]
    prices = [2, 3, 4]
    total_cost = 0
    for i in __(a)__:
        total_cost = (total_cost + 
                     np.count_nonzero(eaten == __(b)__) * __(c)__)

Answer (a): [0, 1, 2] or np.arange(len(bread)) or range(3) or equivalent

Let’s read through the code skeleton and develop the answers from intuition. First, we notice a for loop, but we don’t yet know what sequence we’ll be looping through.

Then we notice a variable total_cost that is initalized to 0. This suggests we’ll use the accumulator pattern to keep a running total.

Inside the loop, we see that we are indeed adding onto the running total. The amount by which we increase total_cost is np.count_nonzero(eaten == __(b)__) * __(c)__. Let’s remember what this does. It first compares the eaten array to some value and then counts how many True values are in resulting Boolean array. In other words, it counts how many times the entries in the eaten array equal some particular value. This is a big clue about how to fill in the code. There are only three possible values in the eaten array, and they are "Sourdough", "Whole Wheat", and "Multigrain". For example, if blank (b) were filled with "Sourdough", we would be counting how many of the slices in the eaten array were "Sourdough". Since each such slice costs 2 dollars, we could find the total cost of all "Sourdough" slices by multiplying this count by 2.

Understanding this helps us understand what the code is doing: it is separately computing the cost of each type of bread ("Sourdough", "Whole Wheat", "Multigrain") and adding this onto the running total. Once we understand what the code is doing, we can figure out how to fill in the blanks.

We just discussed filling in blank (b) with "Sourdough", in which case we would have to fill in blank (c) with 2. But that just gives the contribution of "Sourdough" to the overall cost. We also need the contributions from "Whole Wheat" and "Multigrain". Somehow, blank (b) needs to take on the values in the provided breads array. Similarly, blank (c) needs to take on the values in the prices array, and we need to make sure that we iterate through both of these arrays simultaneously. This means we should access breads[0] when we access prices[0], for example. We can use the loop variable i to help us, and fill in blank (b) with breads[i] and blank (c) with prices[i]. This means i needs to take on the values 0, 1, 2, so we can loop through the sequence [0, 1, 2] in blank (a). This is also the same as range(len(bread)) and range(len(price)).

Bravo! We have everything we want, and the code block is now complete.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.

Answer (b): breads[i]

See explanation above.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.

Answer (c): prices[i]

See explanation above.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 45.2

Another way to calculate the total cost of the bread in the eaten array uses the merge method. Fill in the missing code below.

    available = bpd.DataFrame().assign(Type = ["Sourdough", 
                "Whole Wheat", "Multigrain"]).assign(Cost = [2, 3, 4])
    consumed = bpd.DataFrame().assign(Eaten = eaten)
    combined = available.merge(consumed, left_on = __(d)__, 
                               right_on = __(e)__)   
    total_cost = combined.__(f)__

Answer (d): "Type"

It always helps to sketch out the DataFrames.

Let’s first develop some intuition based on the keywords.

  1. We create two DataFrames as seen above, and we perform a merge) operation. We want to fill in what columns to merge on. It should be easy to see that blank (e) would be the "Eaten" column in consumed, so let’s fill that in.

  2. We are getting total_cost from the combined DataFrame in some way. If we want the total amount of something, which aggregation function might we use?

In general, if we have df_left.merge(df_right, left_on=col_left, right_on=col_right), assuming all entries are non-empty, the merge process looks at individual entries from the specified column in df_left and grabs all entries from the specified column in df_right that matches the entry content. Based on this, we know that the combined DataFrame will contain a column of all the breads we have eaten and their corresponding prices. Blank (d) is also settled: we can get the list of all breads by merging on the "Type" column in available to match with "Eaten".


Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Answer (e): "Eaten"

See explanation above.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.

Answer (f): get("Cost").sum()

The combined DataFrame would look something like this:

To get the total cost of the bread in "Eaten", we can take the sum of the "Cost" column.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.