Lecture 5 — Practice

← return to practice.dsc10.com


Lecture 5 — Collected Practice Questions

Below are practice problems tagged for Lecture 5 (rendered directly from the original exam/quiz sources).


Problem 1

In this question, we’ll keep working with the art_museums DataFrame.


Problem 1.1

(Remember to keep the data description from the top of the exam open in another tab!)

'Tate Modern' is the most popular art museum in London. But what’s the most popular art museum in each city?

It turns out that there’s no way to answer this easily using the tools that you know about so far. To help, we’ve created a new Series method, .last(). If s is a Series, s.last() returns the last element of s (i.e. the element at the very end of s). .last() works with .groupby, too (just like .mean() and .count()).

Fill in the blanks so that the code below correctly assigns best_per_city to a DataFrame with one row per city, that describes the name, number of visitors, and rank of the most visited art museum in each city. best_per_city should be sorted in decreasing order of number of visitors. The first few rows of best_per_city are shown below.


best_per_city = __(a)__.groupby(__(b)__).last().__(c)__
  1. What goes in blank (a)?
  2. What goes in blank (b)?
  3. What goes in blank (c)?

Answers:

  1. art_museums.sort_values('Visitors', ascending=True)
  2. 'City'
  3. sort_values('Visitors', ascending=False)

Let’s take a look at the completed implementation.

best_per_city = art_museums.sort_values('Visitors', ascending=True).groupby('City').last().sort_values('Visitors', ascending=False)

We first sort the row in art_museums by the number of 'Visitors' in ascending order. Then goupby('City'), so that we have one row per city. Recall, we need an aggregation method after using groupby(). In this question, we use last() to only keep the one last row for each city. Since in blank (a) we have sorted the rows by 'Visitors', so last() keeps the row that contains the name of the most visited museum in that city. At last we sort this new DataFrame by 'Visitors' in descending order to fulfill the question’s requirement.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.


Problem 1.2

Assume you’ve defined best_per_city correctly.

Which of the following options evaluates to the number of visitors to the most visited art museum in Amsterdam? Select all that apply.

Answer: best_per_city.get('Visitors').loc['Amsterdam'], best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[0], best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[-1], best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').loc['Amsterdam'] (Select all except “None of the above”)

best_per_city.get('Visitors').loc['Amsterdam'] We first use .get(column_name) to get a series with number of visitors to the most visited art museum, and then locate the number of visitors to the most visited art museum in Amsterdam using .loc[index] since we have "City" as index.

best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[0] We first query the best_per_city to only include the DataFrame with one row with index 'Amsterdam'. Then, we get the 'Visitors' column of this DataFrame. Finally, we use iloc[0] to access the first and the only value in this column.

best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').iloc[-1] We first query the best_per_city to only include the DataFrame with one row with index 'Amsterdam'. Then, we get the 'Visitors' column of this DataFrame. Finally, we use iloc[-1] to access the last and the only value in this column.

best_per_city[best_per_city.index == 'Amsterdam'].get('Visitors').loc['Amsterdam'] We first query the best_per_city to only include the DataFrame with one row with index 'Amsterdam'. Then, we get the 'Visitors' column of this DataFrame. Finally, we use loc['Amsterdam'] to access the value in this column with index 'Amsterdam'.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.



Problem 2


Problem 2.1

Which of these correctly evaluates to the number of flights King Triton took to San Diego (airport code 'SAN')?

Answer: flights[flights.get('TO') == 'SAN'].shape[0]

The strategy is to create a DataFrame with only the flights that went to San Diego, then count the number of rows. The first step is to query with the condition flights.get('TO') == 'SAN' and the second step is to extract the number of rows with .shape[0].

Some of the other answer choices use .loc['SAN'] but .loc only works with the index, and flights does not have airport codes in its index.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 2.2

Fill in the blanks below so that the result also evaluates to the number of flights King Triton took to San Diego (airport code 'SAN').

flights.groupby(__(a)__).count().get('FLIGHT').__(b)__            

What goes in blank (a)?

What goes in blank (b)?

True or False: If we change .get('FLIGHT') to .get('SEAT'), the results of the above code block will not change. (You may assume you answered the previous two subparts correctly.)

Answer: 'TO', .loc['SAN'], True

The strategy here is to group all of King Triton’s flights according to where they landed, and count up the number that landed in San Diego. The expression flights.groupby('TO').count() evaluates to a DataFrame indexed by arrival airport where, for any arrival airport, each column has a count of the number of King Triton’s flights that landed at that airport. To get the count for San Diego, we need the entry in any column for the row corresponding to San Diego. The code .get('FLIGHT') says we’ll use the 'FLIGHT' column, but any other column would be equivalent. To access the entry of this column corresponding to San Diego, we have to use .loc because we know the name of the value in the index should be 'SAN', but we don’t know the row number or integer position.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 2.3

Consider the DataFrame san, defined below.

san = flights[(flights.get('FROM') == 'SAN') & (flights.get('TO') == 'SAN')]

Which of these DataFrames must have the same number of rows as san?

Answer: flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]

The DataFrame san contains all rows of flights that have a departure airport of 'SAN' and an arrival airport of 'SAN'. But as you may know, and as you’re told in the data description, there are no flights from an airport to itself. So san is actually an empty DataFrame with no rows!

We just need to find which of the other DataFrames would necessarily be empty, and we can see that flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')] will be empty for the same reason.

Note that none of the other answer choices are correct. The first option uses the Python keyword and instead of the symbol &, which behaves unexpectedly but does not give an empty DataFrame. The second option will be non-empty because it will contain all flights that have San Diego as the departure airport or arrival airport, and we already know from the first few rows of flight that there are some of these. The third option will contain all the flights that King Triton has taken from 'LAX' to 'SAN'. Perhaps he’s never flown this route, or perhaps he has. This DataFrame could be empty, but it’s not necessarily going to be empty, as the question requires.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.



Problem 3

We define a “route” to be a departure and arrival airport pair. For example, all flights from 'SFO' to 'SAN' make up the “SFO to SAN route”. This is different from the “SAN to SFO route”.

Fill in the blanks below so that most_frequent.get('FROM').iloc[0] and most_frequent.get('TO').iloc[0] correspond to the departure and destination airports of the route that King Triton has spent the most time flying on.

most_frequent = flights.groupby(__(a)__).__(b)__
most_frequent = most_frequent.reset_index().sort_values(__(c)__)


Problem 3.1

What goes in blank (a)?

Answer: ['FROM', 'TO']

We want to organize flights by route. This means we need to group by both 'FROM' and 'TO' so any flights with the same pair of departure and arrival airports get grouped together. To group by multiple columns, we must use a list containing all these column names, as in flights.groupby(['FROM', 'TO']).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 3.2

What goes in blank (b)?

Answer: sum()

Every .groupby command needs an aggregation function! Since we are asked to find the route that King Triton has spent the most time flying on, we want to total the times for all flights on a given route.

Note that .count() would tell us how many flights King Triton has taken on each route. That’s meaningful information, but not what we need to address the question of which route he spent the most time flying on.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.


Problem 3.3

What goes in blank (c)?

Answer: by='HOURS', ascending=False

We want to know the route that King Triton spent the most time flying on. After we group flights by route, summing flights on the same route, the 'HOURS' column contains the total amount of time spent on each route. We need most_frequent.get('FROM').iloc[0] and most_frequent.get('TO').iloc[0] to correspond with the departure and destination airports of the route that King Triton has spent the most time flying on. To do this, we need to sort in descending order of time, to bring the largest time to the top of the DataFrame. So we must sort by 'HOURS' with ascending=False.


Difficulty: ⭐️

The average score on this problem was 94%.



Source: fa22-midterm — Q4

Problem 4

Suppose we’ve run the following line of code. ​

counts = evs.groupby("Brand").count()


Problem 4.1

What value does counts.get("Range").sum() evaluate to?

Answer: 32

counts is a DataFrame with one row per "Brand", since we grouped by "Brand". Since we used the .count() aggregation method, the columns in counts will all be the same – they will all contain the number of rows in evs for each "Brand" (i.e. they will all contain the distribution of "Brand"). If we sum up the values in any one of the columns in counts, then, the result will be the total number of rows in evs, which we know to be 32. Thus, counts.get("Range").sum() is 32.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.


Problem 4.2

What value does counts.index[3] evaluate to?

Answer: "Tesla"

Since we grouped by "Brand" to create counts, the index of counts will be "Brand", sorted alphabetically (this sorting happens automatically when grouping). This means that counts.index will be the array-like sequence ["Audi", "BMW", "Nissan", "Tesla"], and counts.index[3] is "Tesla".


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 33%.



Source: fa23-final — Q3

Problem 5

Consider the variable is_fraud_mean, defined below.

is_fraud_mean = txn.get("is_fraud").mean()

Which of the following expressions are equivalent to is_fraud_mean? Select all that apply.

Answer: B, C, D, and E.

The correct responses are B, C, D, and E. First, we must see that txn.get("is_fraud").mean() will calculate the mean of the "is_fraud" column, which is a float representing the proportion of values in the "is_fraud" column that are True. With this in mind, we can consider each option:

  • Option A: This operation will result in a DataFrame. We first group by "is_fraud", creating one row for fraudulent transactions and one row for non-fraudulent ones. We then take the mean of each numerical column, which will determine the entries of the DataFrame. Since this results in a DataFrame and not a float, this answer choice cannot be correct.

  • Option B: Here we simply take the mean of the "is_fraud" column using the definition of the mean as the sum of the values divided by the nuber of values. This is equivalent to the original.

  • Option C: np.count_nonzero will return the number of nonzero values in a sequence. Since we only have True and False values in the "is_fraud" column, and Python considers True to be 1 and False to be 0, this means counting the number of ones is equivalent to the sum of all the values. So, we end up with an expression equivalent to the formula for the mean which we saw in part B.

  • Option D: Recall that "is_fraud" contains Boolean values, and that True evaluates to 1 and False evaluates to 0. txn.get("is_fraud") > 0.8 conducts an elementwise comparison, evaluating if each value in the column is greater than 0.8, and returning the resulting Series of Booleans. Any True (1) value in the column will be greater than 0.8, so this expression will evaluate to True. Any False (0) value will still evaluate to False, so the values in the resulting Series will be identical to the original column. Therefore, taking the mean of either will give the same value.

  • Option E: txn.get("is_fraud") == 0 performs an elementwise comparison, returning a series which has the value True where "is_fraud" is False (0), and False where "is_fraud" is True. Therefore the mean of this Series represents the proportion of values in the "is_fraud" column that are False. Since every value in that column is either False or True, the proportion of True values is equivalent to one minus the proportion of False values.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Source: fa23-midterm — Q2

Problem 6

Suppose we define the three variables below.

J = living_cost.get("county") == "Benton County"
K = living_cost.get("state") == "IN"
L = living_cost.get("family_type") == "1a2c"

Feel free to use these variables in your solutions to the following questions.


Problem 6.1

Fill in the blanks so that the expression below evaluates to the average yearly childcare cost for families with one adult and two children in Benton County, IN.

    __(a)__.__(b)__.iloc[0]
  1. What goes in blank (a)?

  2. What goes in blank (b)?

Answer:

  1. living_cost[J & K & L]

The first step is to query the rows in the DataFrame that meet our specific criteria. In this case, we want the rows in the DataFrame where the county is "Benton County", the state is "IN", and the family has 1 adult and 2 children. J, K, and L specify these criteria. When used to query the living_cost DataFrame, we are able to obtain a DataFrame with only one row, corresponding this family type in this specific county.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.

Answer:

  1. get("avg_childcare_cost")

Once we have a DataFrame that contains the row we need, we need to use it to get the average yearly childcare cost. To do that, we get out the "avg_childcare_cost" column and access the entry in row 0 with .iloc[0]. This works because after querying, there is only one row, and it corresponds exactly to the families with one adult and two children in Benton County, IN.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 6.2

Fill in the blanks so that the expression below evaluates to the number of states with a county named Benton County.

    __(c)__.__(d)__ / 10
  1. What goes in blank (c)?

  2. What goes in blank (d)?

Answer:

  1. living_cost[J]

Since we want to find how many states have a county named "Benton County", we first want to obtain all the rows of the DataFrame where the county is "Benton County". Variable J specifies this condition, so we use it to query and obtain a DataFrame with the rows in living_cost where the county is "Benton County."


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.

Answer:

  1. shape[0]

Now that we have all the rows in the DataFrame where the county is "Benton County", let’s consider how many rows it has. Each row of this DataFrame represents a unique combination of "state" and "family_type" for counties called "Benton County". We know from the data description that each of the ten family structures is present in each county. This means that for each state with a county called "Benton County", our DataFrame has exactly ten rows. Therefore, the number of states with a county called "Benton County" is the total number of rows in our DataFrame divided by ten. Therefore, we should fill in blank (d) with .shape[0] to extract the number of rows from our DataFrame.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.



Source: fa24-midterm — Q5

Problem 7

Write a Python expression that evaluates to the address of the house with the most pieces of candy available (the most pieces, not the most varieties).

It’s okay if you need to write on multiple lines, but your code should represent a single expression in Python.

Answer: treat.groupby("address").sum().sort_values(by="how_many", ascending = False).index[0] or treat.groupby("addresss").sum().sort_values(by="how_many").index[-1]

In the treat DataFrame, there are multiple rows for each address, one for each candy they are giving out with their quantity. Since we want the address with the most pieces of candy available, we need to combine this information, so we start by grouping by address: treat.groupby(“address”). Now, since we want to add the number of candy available per address, we use the sum() aggregate function. So now we have a DataFrame with one row per address where there value in each column is the sum of all the values. To get the address with the most pieces of candy available, we can simply sort by the “how_many” column since this stores the total amount of candy per house. Setting ascending=False means that the address with the greatest amount of candy will be the first row. Since the addresses are located in the index as a result of the groupby, we can access this value by using index[0].

Note: If you do not set ascending=False, then the address with the most amount of candy available will be the last row which you can access by index[-1].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Source: fa24-midterm — Q11

Problem 8

As in the last problem, we’ll continue working with the tot DataFrame that came from merging trick with treat. The "price" column contains the cost in dollars of a single piece of fun-sized candy, as a float.

In this problem, we want to use tot to calculate the average cost per piece of Halloween candy at each house. For example, suppose one house has 30 Twix, which cost \$0.20 each, and 20 Laffy Taffy, which cost \$0.10 each. Then this house spent \$8.00 on 50 pieces of candy, for an average cost of \$0.16 per piece.

Which of the following correctly sets ac to a DataFrame indexed by "address" with a column called "avg_cost" that contains the average cost per piece of Halloween candy at each address? Select all that apply.

Way 1:

ac = tot.groupby("address").sum()
ac = ac.assign(avg_cost = ac.get("price") / 
                          ac.get("how_many")).get(["avg_cost"])

Way 2:

ac = tot.assign(x = tot.get("price") / tot.get("how_many"))
ac = ac.groupby("address").sum()
ac = ac.assign(avg_cost = ac.get("x").mean()).get(["avg_cost"])

Way 3:

ac = tot.assign(x = tot.get("price") / tot.get("how_many"))
ac = ac.groupby("address").sum()
ac = ac.assign(avg_cost = ac.get("x") / 
                          ac.get("how_many")).get(["avg_cost"])

Way 4:

ac = tot.assign(x = tot.get("how_many") * tot.get("price"))
ac = ac.groupby("address").sum()
ac = ac.assign(avg_cost = ac.get("x").mean()).get(["avg_cost"])

Way 5:

ac = tot.assign(x = tot.get("how_many") * tot.get("price"))
ac = ac.groupby("address").sum()
ac = ac.assign(avg_cost = ac.get("x") / 
                          ac.get("how_many")).get(["avg_cost"])

Answer: Option 5

We need the average cost per piece at each house.

The correct formula would be: (total spent on candy) / (total pieces of candy)

Let’s go through each Way and assess if it is valid or not.

Way 1: When we sum the “price” column directly, we’re summing the per-piece prices, not the total spent. This gives wrong totals. For example, if a house has 30 pieces at $0.20 and 20 at $0.10, summing prices gives $0.30 instead of $8.00.

Way 2: This first calculates price/quantity for each candy type, then takes the mean of these ratios. This is mathematically incorrect for finding average cost per piece.

  • For Twix: $0.20/30 = $0.00667 per piece
  • For Laffy Taffy: $0.10/20 = $0.005 per piece
  • Takes mean: ($0.00667 + $0.005)/2 = $0.00583
  • This is wrong because it’s taking mean of ratios instead of ratio of totals

Way 3: Similar to Way 2, but even more problematic as it divides by quantity twice.

  • For Twix: $0.20/30 = $0.00667
  • For Laffy Taffy: $0.10/20 = $0.005
  • Sums these: $0.00667 + $0.005 = $0.01167
  • Divides by total quantity again: $0.01167/50 = $0.000233

Way 4: Correctly calculates total spent (x = quantity * price) but then takes the mean of the totals instead of dividing by total quantity.

  • For Twix: 30 × $0.20 = $6.00
  • For Laffy Taffy: 20 × $0.10 = $2.00
  • Takes mean of these totals: ($6.00 + $2.00)/2 = $4.00 (wrong)
  • This is wrong because it takes mean of totals instead of dividing by total quantity

Way 5: This is correct because:

  • First calculates total spent on each candy type (quantity * price per piece)
  • Groups by address and sums both the total spent and total quantities
  • Finally divides total spent by total pieces to get average cost per piece

Using our example:

  • 30 Twix at $0.20 = $6.00
  • 20 Laffy Taffy at $0.10 = $2.00
  • Total spent = $8.00
  • Total pieces = 50
  • Average = $8.00/50 = $0.16 per piece, the correct answer.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Source: fa24-midterm — Q13

Problem 9

Extra Credit

Define the variable double as follows.

double = treat.groupby("candy").count().groupby("address").count()

Now, suppose you know that

double.loc[1].get("how_many") evaluates to 5.

Which of the following is a valid interpretation of this information? Select all that apply.

Answer: Option 2

Let’s approach this solution by breaking down the line of code into two intermediate steps, so that we can parse them one at a time:

  • intermediate_one = treat.groupby("candy").count()
  • double = intermediate_one.groupby("address").count()

Step 1: intermediate_one = treat.groupby("candy").count()

The first of our two operations groups the treat DataFrame by the "candy" column, and aggregates using the .count() method. This creates an output DataFrame that is indexed by "candy", where the values in each column represent the number of times each candy appeared in the treat DataFrame.

Remember, in our original DataFrame, each row represents one type of candy being given out by one house. So, each row in intermediate_one will contain the number of houses giving out each candy. For example, if the values in the columns in the row with row label Milky Way were all 3, it would mean that there are 3 houses giving out Milky Ways.

Step 2: double = intermediate_one.groupby("address").count()

The second of our two operations groups the intermediate_one DataFrame by the "address" column, and aggregates using the .count() method. This creates an output DataFrame that is indexed by "address", where the values in each column represent the number of times that each value in the address column appeared in the intermediate_one DataFrame. However, these are more difficult to interpret, so let’s break down what this means in the context of our problem.

The values in the intermediate_one DataFrame represent how many houses are giving out a specific type of candy (this is the result of our first operation). So, when we group by these values, the resulting groups will be defined by all candies that are given out by the same number of houses. For example, if the values in the columns with row label 5 were all 2, it would mean that there are 2 types of candy that are being given out by 5 houses. More concretely, this would mean that the value 5 showed up 2 times in the intermediate_one DataFrame, which means there must have been 2 candies that were being given out by 5 houses (see above).

Combining these two results, we can interpret the output of our original line of code:

double = treat.groupby("candy").count().groupby("address").count() outputs a DataFrame where the value in each row represents the number of different candies that are being given out by the same number of houses.

Now, we can easily interpret this line of code:

double.loc[1].get("how_many") evaluates to 5.

This means that there are 5 different types of candies that are being given out by only 1 house. This corresponds to Option 2 and only Option 2 in our answer choices, so Option 2 is the correct answer.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 15%.


Problem 10

Complete the expression below so that it evaluates to a DataFrame indexed by 'category' with one column called 'price' containing the median cost of the products in each category.

ikea.___(a)___.get(___(b)___)


Problem 10.1

What goes in blank (a)?

Answer: groupby('category').median()

The question prompt says that the resulting DataFrame should be indexed by 'category', so this is a hint that we may want to group by 'category'. When using groupby, we need to specify an aggregation function, and since we’re looking for the median cost of all products in a category, we use the .median() aggregation function.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 10.2

What goes in blank (b)?

Answer: ['price']

The question prompt also says that the resulting DataFrame should have just one column, the 'price' column. To keep only certain columns in a DataFrame, we use .get, passing in a list of columns to keep. Remember to include the square brackets, otherwise .get will return a Series instead of a DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.



Problem 11

Write one line of code below to create a DataFrame called openers containing the artists that appeared first on stage at a past Sun God festival. The DataFrame openers should have all the same columns as sungod.

Answer: openers = sungod[sungod.get('Appearance_Order')==1]

Since we want only certain rows of sungod, we need to query. The condition to satisfy is that the 'Appearance_Order' column should have a value of 1 to indicate that this artist performed first in a certain year’s festival.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 12

What was the largest number of artists that ever performed in a single Sun God festival? Select all expressions that evaluate to the correct answer.

Answer: sungod.groupby('Year').count().get('Artist').max(), sungod.get('Appearance_Order').max()

Let’s go through all the answer choices.

For the first option, sungod.groupby('Appearance_Order').count() will create a DataFrame with one row for each unique value of 'Appearance_Order', and each column will contain the same value, which represents the number of Sun God festivals that had at least a certain amount of performers. For example, the first row of sungod.groupby('Appearance_Order').count() will correspond to an 'Appearance_Order' of 1, and each column will contain a count of the number of Sun God festivals with at least one performer. Since every festival has at least one performer, the largest count in any column, including 'Year' will be in this first row. So sungod.groupby('Appearance_Order').count().get('Year').max() represents the total number of Sun God festivals, which is not the quantity we’re trying to find.

For the second option, sungod.groupby('Year').count() will create a DataFrame with one row per year, with each column containing a count of the number of artists that performed in that year’s festival. If we take the largest such count in any one column, we are finding the largest number of artists that ever performed in a single Sun God festival. Therefore, sungod.groupby('Year').count().get('Artist').max() is correct.

The third option works because we can find the desired quantity by simply looking for the largest value in the 'Appearance_Order' column. For example, if the largest number of artists to ever perform in a Sun God festival was, say, 17, then for that year’s festival, the last artist to appear would have a value of 17 in the 'Appearance_Order' column. There can be no 18 anywhere in the 'Appearance_Order' column, otherwise that would mean there was some festival with 18 performers. Therefore, sungod.get('Appearance_Order').max() is correct.

The fourth option is not even correct Python code. The DataFrame produced by sungod.groupby('Year').max() is indexed by 'Year' and no longer has 'Year' as a column. So we’d get an error if we tried to access this nonexistent column, as in sungod.groupby('Year').max().get('Year').


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 13


Problem 13.1

We’d like to find the mean number of Instagram followers of all students in DSC 10. One correct way to do so is given below.

    mean_1 = survey.get("IG Followers").mean()

Another two possible ways to do this are given below.

    # Possible method 1.
    mean_2 = survey.groupby("Section").mean().get("IG Followers").mean()

    # Possible method 2.
    X = survey.groupby("Section").sum().get("IG Followers").sum()
    Y = survey.groupby("Section").count().get("IG Followers").sum()
    mean_3 = X / Y

Is mean_2 equal to mean_1?

Is mean_3 equal to mean_1?

Answer:

Is mean_2 is equal to mean_1? Yes, if both sections have the same number of students, otherwise maybe.

mean_2 is the “mean of means” – it finds the mean number of "IG Followers" for each section, then finds the mean of those two numbers. This is not in general to the overall mean, because it doesn’t consider the fact that Section A may have more students than Section B (or vice versa); if this is the case, then Section A needs to be weighted more heavily in the calculation of the overall mean.

Let’s look at a few examples to illustrate our point. - Suppose Section A has 2 students who both have 10 followers, and Section B has 1 student who has 5 followers. The overall mean number of followers is \frac{10 + 10 + 5}{3} = 8.33.., while the mean of the means is \frac{10 + 5}{2} = 7.5. These are not the same number, so mean_2 is not always equal to mean_1. We can rule out “Yes” as an option. - Suppose Section A has 2 students where one has 5 followers and one has 7, and Section B has 2 students where one has 3 followers and one has 15 followers. Then, the overall mean is \frac{5 + 7 + 3 + 13}{4} = \frac{28}{4} = 7, while the mean of the means is \frac{\frac{5+7}{2} + \frac{3 + 13}{2}}{2} = \frac{6 + 8}{2} = 7. If you experiment (or even write out a full proof), you’ll note that as long as Sections A and B have the same number of students, the overall mean number of followers across their two sections is equal to the mean of their section-specific mean numbers of followers. We can rule out “No” as an option. - Suppose Section A has 2 students who both have 10 followers, and Section B has 3 students who both have 10 followers. Then, the overall mean is 10, and so is the mean of means. So, it’s possible for there to be a different number of students in Sections A and B and for mean_2 to be equal to mean_1. It’s not always, true, though, which is why the answer is “Yes, if both sections have the same number of students, otherwise maybe” and we can rule out the “otherwise no” case.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 15%.



Is mean_3 is equal to mean_1? Yes.

Let’s break down the calculations:

  • For X: survey.groupby("Section").sum().get("IG Followers") calculates the total number of followers for each section separately. The subsequent .sum() then adds these totals together, providing the total number of followers in the entire class.
  • For Y: survey.groupby("Section").count().get("IG Followers") counts the number of students in each section. The subsequent .sum() then aggregates these counts, giving the total number of students in the entire class.

Then, mean_3 = X / Y divides the total number of Instagram followers by the total number of students to get the overall average number of followers per student for the entire dataset. This is precisely how mean_1 is calculated. Hence, mean_3 and mean_1 will always be equal, so the answer is “Yes.”


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.



Source: sp23-midterm — Q1

Problem 14


Problem 14.1

What is the type of the following expression?

(survey
 .sort_values(by="Class Standing")
)

Answer: DataFrame

The method .sort_values(by="Class Standing") sorts the survey DataFrame by the "Class Standing" column and returns a new DataFrame (without modifying the original one). The resulting DataFrame will be sorted by class standing in ascending order unless specified otherwise by providing the ascending=False argument.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 14.2

What is the type of the following expression?

(survey
 .sort_values(by="Class Standing")
 .groupby("College").count()
)

Answer: DataFrame

The method .sort_values(by="Class Standing") sorts the survey DataFrame based on the entries in the "Class Standing" column and produces a new DataFrame. Following this, .groupby("College") organizes the sorted DataFrame by grouping entries using the "College" column. After that, the .count() aggregation method computes the number of rows for each "College". The result is a DataFrame whose index contains the unique values in "College" and whose columns all contain the same values – the number of rows in survey.sort_values(by="Class Standing") for each "College".

Note that if we didn’t sort before grouping – that is, if our expression was just survey.groupby("College").count() – the resulting DataFrame would be the same! That’s because the order of the rows in survey does not impact the number of rows in survey that belong to each "College".


Difficulty: ⭐️

The average score on this problem was 93%.



Problem 14.3

What is the type of the following expression?

(survey
 .sort_values(by="Class Standing")
 .groupby("College").count()
 .get("IG Followers")
)

Answer: Series

The above expression, before .get("IG Followers"), is the same as in the previous subpart. We know that

survey.sort_values(by="Class Standing").groupby("College").count()

is a DataFrame indexed by "College" with multiple columns, each of which contain the number of rows in survey.sort_values(by="Class Standing") for each "College". Then, using .get("IG Followers") extracts just the "IG Followers" column from the aforementioned counts DataFrame, and we know that columns are stored as Series. We know that this column exists, because survey.sort_values(by="Class Standing").groupby("College").count() will have the same columns as survey, minus "College", which was moved to the index.


Difficulty: ⭐️

The average score on this problem was 92%.



Problem 14.4

What is the type of the following expression?

(survey
 .sort_values(by="Class Standing")
 .groupby("College").count()
 .get("IG Followers")
 .iloc[0]
)

Answer: int or float

In the previous subpart, we saw that

survey.sort_values(by="Class Standing").groupby("College").count().get("IG Followers")

is a Series whose index contains the unique names of "College"s and whose values are the number of rows in survey.sort_values(by="Class Standing") for each college. The number of rows in survey.sort_values(by="Class Standing") for any particular college – say, "Sixth" – is a number, and so the values in this Series are all numbers. As such, the answer is int or float.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 14.5

What value does the following expression evaluate to? If you believe the expression errors, provide a one sentence explanation of why.

For example, if you think the expression evaluates to an int, don’t write “int" or”the largest value in the "IG Followers" column", write the specific int, like “23".

Hint: You have enough information to answer the problem; don’t forget to look at the data description page.

(survey
.sort_values(by="Class Standing")
.groupby("College").count()
.get("IG Followers")
.index[0]
)

Answer: "ERC"

From the previous subpart, we saw that

survey
.sort_values(by="Class Standing")
.groupby("College").count()
.get("IG Followers")

is a Series, indexed by "College", whose values contain the number of rows in survey.sort_values(by="Class Standing") for each "College". When we group by "College", the resulting DataFrame (and hence, our Series) is sorted in ascending order by "College". This means that the index of our Series is sorted alphabetically by "College" names. Of the "College" names mentioned in the data description ("ERC", "Marshall", "Muir", "Revelle", "Seventh", "Sixth", and "Warren"), the first name alphabetically is "ERC", so using .index[0] on the above index gives us "ERC".


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 32%.


Problem 14.6

Consider again the expression from 1.5. Suppose we remove the piece

.sort_values(by="Class Standing")

but keep all other parts the same. Does this change the value the expression evaluates to?

Answer: No, this does not change the value the expression evaluates to.

We addressed this in the second subpart of the problem: “Note that if we didn’t sort before grouping – that is, if our expression was just survey.groupby("College").count() – the resulting DataFrame would be the same! That’s because the order of the rows in survey does not impact the number of rows in survey that belong to each "College".


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 14.7

What value does the following expression evaluate to? If you believe the expression errors, provide a one sentence explanation of why.

For example, if you think the expression evaluates to an int, don’t write “int” or “the largest value in the "IG Followers" column”, write the specific int, like “23”.

Hint: You have enough information to answer the problem; don’t forget to look at the data description page.

(survey
 .sort_values(by="Class Standing")
 .groupby("College").count()
 .get("IG Followers")
 .loc[0]
)

Answer: The code produces an error because, after the grouping operation, the resulting Series uses the unique college names as indices, and there isn’t a college named 0.

In the previous few subparts, we’ve established that

survey
.sort_values(by="Class Standing")
.groupby("College").count()
.get("IG Followers")

is a Series, indexed by "College", whose values contain the number of rows in survey.sort_values(by="Class Standing") for each "College".

The .loc accessor is used to extract a value from a Series given its label, or index value. However, since the index values in the aforementioned Series are "College" names, there is no value whose index is 0, so this throws an error.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.



Source: sp23-midterm — Q4

Problem 15


Problem 15.1

Consider the following block of code.

    A = survey.shape[0]
    B = survey.groupby(["Unread Emails", "IG Followers"]).count().shape[0]

Suppose the expression A == B evaluates to True. Given this fact, what can we conclude?

Answer: There are no two students in the class with both the same number of unread emails and the same number of Instagram followers.

The DataFrame survey.groupby(["Unread Emails", "IG Followers"]).count() will have one row for every unique combination of "Unread Emails" and "IG Followers". If two students had the same number of "Unread Emails" and "IG Followers", they would be grouped together, resulting in fewer groups than the total number of students. But since A == B, it indicates that there are as many unique combinations of these two columns as there are rows in the survey DataFrame. Thus, no two students share the same combination of "Unread Emails" and "IG Followers".

Note that if student X has 2 "Unread Emails" and 5 "IG Followers", student Y has 2 "Unread Emails" and 3 "IG Followers", and student Z has 3 "Unread Emails" and 5 "IG Followers", they all have different combinations of "Unread Emails" and "IG Followers", meaning that they’d all be represented by different rows in survey.groupby(["Unread Emails", "IG Followers"]).count(). This is despite the fact that some of their numbers of "Unread Emails" and "IG Followers" are not unique.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 15.2

We’d like to find the mean number of Instagram followers of all students in DSC 10. One correct way to do so is given below.

    mean_1 = survey.get("IG Followers").mean()

Another two possible ways to do this are given below.

    # Possible method 1.
    mean_2 = survey.groupby("Section").mean().get("IG Followers").mean()

    # Possible method 2.
    X = survey.groupby("Section").sum().get("IG Followers").sum()
    Y = survey.groupby("Section").count().get("IG Followers").sum()
    mean_3 = X / Y

Is mean_2 equal to mean_1?

Is mean_3 equal to mean_1?

Answer:

Is mean_2 is equal to mean_1? Yes, if both sections have the same number of students, otherwise maybe.

mean_2 is the “mean of means” – it finds the mean number of "IG Followers" for each section, then finds the mean of those two numbers. This is not in general to the overall mean, because it doesn’t consider the fact that Section A may have more students than Section B (or vice versa); if this is the case, then Section A needs to be weighted more heavily in the calculation of the overall mean.

Let’s look at a few examples to illustrate our point. - Suppose Section A has 2 students who both have 10 followers, and Section B has 1 student who has 5 followers. The overall mean number of followers is \frac{10 + 10 + 5}{3} = 8.33.., while the mean of the means is \frac{10 + 5}{2} = 7.5. These are not the same number, so mean_2 is not always equal to mean_1. We can rule out “Yes” as an option. - Suppose Section A has 2 students where one has 5 followers and one has 7, and Section B has 2 students where one has 3 followers and one has 15 followers. Then, the overall mean is \frac{5 + 7 + 3 + 13}{4} = \frac{28}{4} = 7, while the mean of the means is \frac{\frac{5+7}{2} + \frac{3 + 13}{2}}{2} = \frac{6 + 8}{2} = 7. If you experiment (or even write out a full proof), you’ll note that as long as Sections A and B have the same number of students, the overall mean number of followers across their two sections is equal to the mean of their section-specific mean numbers of followers. We can rule out “No” as an option. - Suppose Section A has 2 students who both have 10 followers, and Section B has 3 students who both have 10 followers. Then, the overall mean is 10, and so is the mean of means. So, it’s possible for there to be a different number of students in Sections A and B and for mean_2 to be equal to mean_1. It’s not always, true, though, which is why the answer is “Yes, if both sections have the same number of students, otherwise maybe” and we can rule out the “otherwise no” case.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 15%.



Is mean_3 is equal to mean_1? Yes.

Let’s break down the calculations:

  • For X: survey.groupby("Section").sum().get("IG Followers") calculates the total number of followers for each section separately. The subsequent .sum() then adds these totals together, providing the total number of followers in the entire class.
  • For Y: survey.groupby("Section").count().get("IG Followers") counts the number of students in each section. The subsequent .sum() then aggregates these counts, giving the total number of students in the entire class.

Then, mean_3 = X / Y divides the total number of Instagram followers by the total number of students to get the overall average number of followers per student for the entire dataset. This is precisely how mean_1 is calculated. Hence, mean_3 and mean_1 will always be equal, so the answer is “Yes.”


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.



Source: sp24-final — Q15

Problem 16

Imagine a DataFrame constructed from apts called bedrooms, which has one row for each bedroom in an apartment in apts. More specifically, a one bedroom apartment in apts will appear as one row in bedrooms, a two bedroom apartment in apts will appear as two rows in bedrooms, and a three bedroom apartment in apts will appear as three rows in bedrooms. Studio apartments will not appear in bedrooms at all.

The "Apartment ID" column of bedrooms contains the "Apartment ID" of the apartment in apts. Notice that this is not the index of bedrooms since these values are no longer unique. The "Cost" column of bedrooms contains the rent of the apartment divided by the number of bedrooms. All rows of bedrooms with a shared "Apartment ID" should therefore have the same value in the "Cost" column.


Problem 16.1

Recall that apts has 800 rows. How many rows does bedrooms have?

Answer: Not enough information.

It is entirely possible that bedrooms has more or less than 800 rows; we don’t have enough info to tell.

If most of the 800 rows in apts are studio apartments, most rows in apts will not have corresponding rows in bedrooms (studio apartments are not reflected in bedrooms). This would lower the total number of rows in bedrooms to less than 800.

If most of the 800 rows in apts are three-bedroom apartments, most rows in apts will each have three corresponding rows in bedrooms. This would increase the total number of rows in bedrooms to more than 800.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.


Problem 16.2

Suppose no_studio is defined as follows. (Remember, we previously converted the "Beds" column to integers.)

no_studio = apts[apts.get("Bed") != 0]

Which of the following statements evaluate to the same value as the expression below?

bedrooms.get("Cost").mean()

Select all that apply.

Answer: Options 2, 4, and 5.

Let’s refer to bedrooms.get("Cost").mean() as “the bedroom code” for this solution.

Option 1 is incorrect. Option 1 takes the mean of all non-studio apartment rents in apts. This value is significantly larger than what is produced by the bedroom code (average value of the “Cost” column in bedrooms), since all “Cost” values in bedrooms are less than or equal to their corresponding “Rent” values in apts. So, these two expressions cannot be equal.

Option 2 is correct. We can view the bedroom code as the same as summing all of the values in the “Cost” column of bedrooms and dividing by the total number of rows of bedrooms. This is a fraction; we can make some clever substitutions in this fraction to show it is the same as the code for Option 2:


\dfrac{\text{sum of ``Cost" in bedrooms}}{\# \text{ of rows in bedrooms}} \to \dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\# \text{ of rows in bedrooms}} \to \dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\text{sum of ``Bed" in apts}}


Option 3 is incorrect. The first part of Option 3, no_studio.get("Rent") / no_studio.get("Bed"), produces a Series that contains all the values in the “Cost” column of no_studio, except without duplicated rows for multi-bed apartments. Taking the .mean() of this look-alike Series is not the same as taking the .mean() of the bedroom code, so these two expressions cannot be equal.

Option 4 is correct. We can show the bedroom code is equivalent to the code in Option 4 as follows:


\dfrac{\text{sum of ``Cost" in bedrooms}}{\# \text{ of rows in bedrooms}} \to \dots \to \dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\text{sum of ``Bed" in apts}} \to

\dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\text{sum of ``Bed" in no}\_\text{studio}} \to \text{sum} \left( \dfrac{\text{each entry in ``Rent" in no}\_\text{studio}}{\text{sum of ``Bed" in no}\_\text{studio}} \right)


Option 5 is correct. We can show the bedroom code is equivalent to the code in Option 5 as follows:


\dfrac{\text{sum of ``Cost" in bedrooms}}{\# \text{ of rows in bedrooms}} \to \dots \to \dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\text{sum of ``Bed" in no}\_\text{studio}} \to

\dfrac{\left(\dfrac{\text{sum of ``Rent" in no}\_\text{studio}}{\# \text{ of rows in no}\_\text{studio}}\right)}{\left(\dfrac{\text{sum of ``Bed" in no}\_\text{studio}}{\# \text{ of rows in no}\_\text{studio}}\right)} \to \dfrac{\text{mean of ``Rent" in no}\_\text{studio}}{\text{mean of ``Bed" in no}\_\text{studio}}



Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.



Source: sp24-midterm — Q3

Problem 17

Write a Python expression that evaluates to a DataFrame of only your contacts whose phone numbers end in "6789".

Note: Do not use slicing, if you know what that is. You must use methods from this course.

Answer: contacts[contacts.get("Phone").str.contains("6789")] or contacts[contacts.get("Phone").str.contains("-6789")]

  • contacts.get("Phone"): retrieves the "Phone" column from the contacts DataFrame as a Series.
  • .str.contains("6789"): applies a string method that checks for the presence of the substring "6789" in each phone number, which could only present in the end of the phone number. It returns a Boolean Series indicating True for phone numbers containing this substring.
  • contacts[...]: retrieve all rows where the condition is True.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.


Source: sp25-midterm — Q1

Problem 18


Problem 18.1

Which country has the highest "Reciprocal Tariff"? Write one line of code that evaluates to the name of this country.

Answer: tariffs.sort_values(by="Reciprocal Tariff", ascending=False).get("Country").iloc[0] or tariffs.set_index("Country").sort_values(by= Reciprocal Tariff", ascending=False).index[0]

To find the country with the highest “Reciprocal Tariff”, we need to first sort tariffs by the "Reciprocal Tariff" column in descending order: tariffs.sort_values(by="Reciprocal Tariff", ascending=False). After sorting in descending order, we know that the country with the highest reciprocal tariff will be in the first row. Then, to get the name of this country, we select the "Country" column using .get("Country"). This gives us a Series of countries in order of descending reciprocal tariffs, and we can access the first value in this Series by using .iloc[0]. Note that if we sort in ascending order instead, we can use .iloc[-1] to access the last value in the Series.

Alternatively, since each country appears only once in tariffs, we can start by setting the "Country" column as the index: tariffs.set_index("Country"). Then, we sort the resulting DataFrame by the "Reciprocal Tariff" column in descending order, as explained above. Since the country names are now in the index, the first row of this sorted DataFrame will have the country with the highest reciprocal tariff as its index value, which we can access by using .index[0]. Note that if we sort in ascending order instead, we can use .index[-1] to access the last value in the index.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Problem 18.2

How many countries have a "Reciprocal Tariff" above 30\%? Write one line of code that evaluates to the number of such countries.

Answer: tariffs[tariffs.get("Reciprocal Tariff")>30].shape[0]


Difficulty: ⭐️⭐️

The average score on this problem was 89%.



Problem 19

The following code computes an array containing the unique kinds of dogs that are heavier than 20 kg or taller than 40 cm on average.

foo = df.__(a)__.__(b)__
np.array(foo[__(c)__].__d__)


Problem 19.1

Fill in blank (a).

Answer: groupby('kind')

We start this problem by grouping the dataframe by 'kind' since we’re only interested in whether each unique 'kind' of dog fits some sort of constraint. We don’t quite perform querying yet since we need to group the DataFrame first. In other words, we first need to group the DataFrame into each 'kind' before we could apply any sort of boolean conditionals.


Difficulty: ⭐️

The average score on this problem was 97%.


Problem 19.2

Fill in blank (b).

Answer: .mean()

After we do .groupby('kind'), we need to apply .mean() since the problem asks if each unique 'kind' of dog satisfies certain constraints on average. .mean() calculates the average of each column of each group which is what we want.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 19.3

Fill in blank (c).

Answer: (foo.get('weight') > 20 | foo.get('height') > 40)

Once we have grouped the dogs by 'kind' and have calculated the average stats of each kind of dog, we can do some querying with two conditionals: foo.get('weight') > 20 gets the kinds of dogs that are heavier than 20 kg on average and foo.get('height') > 40) gets the kinds of dogs that are taller than 40 cm on average. We combine these two conditions with the | operator since we want the kind of dogs that satisfy either condition.


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 19.4

Which of the following should fill in blank (d)?

Answer: .index

Note that earlier, we did groupby('kind'), which automatically sets each unique 'kind' as the index. Since this is what we want anyways, simply doing .index will give us all the kinds of dogs that satisfy the given conditions.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 19.5

Let’s say that right after blank (b), we added reset_index(). Now, which of the following should fill in blank (d)?

Answer: .get('kind')

Now that we have reset the index of the dataframe, 'kind' is once again its own column so we could simply do .get('kind').


Difficulty: ⭐️

The average score on this problem was 100%.



Problem 20

The following code computes an array containing the unique kinds of dogs that are heavier than 20 kg or taller than 40 cm on average.

foo = df.__(a)__.__(b)__
np.array(foo[__(c)__].__d__)


Problem 20.1

Fill in blank (a).

Answer: groupby('kind')

We start this problem by grouping the dataframe by 'kind' since we’re only interested in whether each unique 'kind' of dog fits some sort of constraint. We don’t quite perform querying yet since we need to group the DataFrame first. In other words, we first need to group the DataFrame into each 'kind' before we could apply any sort of boolean conditionals.


Difficulty: ⭐️

The average score on this problem was 97%.


Problem 20.2

Fill in blank (b).

Answer: .mean()

After we do .groupby('kind'), we need to apply .mean() since the problem asks if each unique 'kind' of dog satisfies certain constraints on average. .mean() calculates the average of each column of each group which is what we want.


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 20.3

Fill in blank (c).

Answer: (foo.get('weight') > 20 | foo.get(height) > 40)

Once we have grouped the dogs by 'kind' and have calculated the average stats of each kind of dog, we can do some querying with two conditionals: foo.get('weight') > 20 gets the kinds of dogs that are heavier than 20 kg on average and foo.get('height') > 40) gets the kinds of dogs that are taller than 40 cm on average. We combine these two conditions with the | operator since we want the kind of dogs that satisfy either condition.


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 20.4

Which of the following should fill in blank (d)?

Answer: .index

Note that earlier, we did groupby('kind'), which automatically sets each unique 'kind' as the index. Since this is what we want anyways, simply doing .index will give us all the kinds of dogs that satisfy the given conditions.


Difficulty: ⭐️

The average score on this problem was 94%.



Problem 21

The following code computes the breed of the cheapest toy dog.

df[__(a)__].__(b)__.__(c)__


Problem 21.1

Fill in part (a).

Answer: df.get('kind') == 'toy'

To find the cheapest toy dog, we can start by narrowing down our dataframe to only include dogs that are of kind toy. We do this by constructing the following boolean condition: df.get('kind') == 'toy', which will check whether a dog is of kind toy (i.e. whether or not a given row’s 'kind' value is equal to 'toy'). As a result, df[df.get('kind') == 'toy'] will retrieve all rows for which the 'kind' column is equal to 'toy'.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 21.2

Fill in part (b).

Answer: .sort_values('price')

Next, we can sort the resulting dataframe by price, which will make the minimum price (i.e. the cheapest toy dog) easily accessible to us later on. To sort the dataframe, simply use .sort_values(), with parameter 'price' as follows: .sort_values('price')


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 21.3

Which of the following can fill in blank (c)? Select all that apply.

Answer: index[0]

  • loc[0]: loc retrieves an element by the row label, which in this case is by 'breed', not by index value. Furthermore, loc actually returns the entire row, which is not what we are looking for. (Note that we are trying to find the singular 'breed' of the cheapest toy dog.)
  • iloc[0]: While iloc does retrieve elements by index position, iloc actually returns the entire row, which is not what we are looking for.
  • index[0]: Note that since 'breed' is the index column of our dataframe, and since we have already filtered and sorted the dataframe, simply taking the 'breed' at index 0, or index[0] will return the 'breed' of the cheapest toy dog.
  • min(): min() is a method used to find the smallest value on a series not a dataframe.

Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Source: su24-final — Q1

Problem 22


Problem 22.1

Fill in the blanks so that the expression below evaluates to the proportion of stages won by the country with the most stage wins.

    stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]

Answer:

  • (i): "Winner Country"
    To calculate the number of stages won by each country, we need to group the data by the Winner Country. This will allow us to compute the counts for each group.

  • (ii): count()
    Once the data is grouped, we use the .count() method to calculate the number of stages won by each country.

  • (iii): max()
    Finds the maximum number of stages won by a single country. Finally, we divide the maximum stage wins by the total number of stages (stages.shape[0]) to calculate the proportion of stages won by the top country.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 22.2

The distance of a stage alone does not encapsulate its difficulty, as riders feel more tired as the tour goes on. Because of this, we want to consider “real distance” a measurement of the length of a stage that takes into account how far into the tour the riders are. The “real distance” is calculated with the following process:

  1. Add one to the stage number.

  2. Take the square root of the result of (i).

  3. Multiply the result of (ii) by the raw distance of the stage.

Complete the implementation of the function real_distance, which takes in stages (a DataFrame), stage (a string, the name of the column containing stage numbers), and distance (a string, the name of the column containing stage distances). real_distance returns a Series containing all of the “real distances” of the stages, as calculated above.

    def real_distance(stages, stage, distance):
         ________

Answer: return stages.get(distance) * np.sqrt(stages.get(stage) + 1)

  • (i): First, We need to add one to the stage number. The stage parameter specifies the name of the column containing the stage numbers. stages.get(stage) retrieves this column as a Series, and we can directly add 1 to each element in the series by stages.get(stage) + 1

  • (ii): Then, to take the square root of the result of (i), we can use np.sqrt(stages.get(stage) + 1)

  • (iii): Finally, we want to multiply the result of (ii) by the raw distance of the stage. The distance parameter specifies the name of the column containing the raw distances of each stage. stages.get(distance) retrieves this column as a pandas Series, and we can directly multiply it by np.sqrt(stages.get(stage) + 1).


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 22.3

Sometimes, stages are repeated in different editions of the Tour de France, meaning that there are some pairs of "Origin" and "Destination" that appear more than once in stages. Fill in the blanks so that the expression below evaluates how often the most common "Origin" and "Destination" pair in the stages DataFrame appears.

stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]

Answer:

  • (i): ["Origin", "Destination"]
    To analyze the frequency of stages with the same origin and destination, we need to group the data by the columns ["Origin", "Destination"]. This groups the stages into unique pairs of origin and destination.

  • (ii): count()
    After grouping, we apply the .count() method to calculate how many times each pair of ["Origin", "Destination"] appears in the dataset. The result is the frequency of each pair.

  • (iii): -1
    After obtaining the frequencies, we sort the resulting groups by their counts in ascending order (this is the default behavior of .sort_values()). The most common pair will then be the last entry in the sorted result. Using .get("Type") extracts the series of counts, and .iloc[-1] retrieves the count of the most common pair, which is at the last position of the sorted series.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 22.4

Fill in the blanks so that the value of mystery_three is the "Destination" of the longest stage before Stage 12.

    mystery = stages[stages.get(__(i)__) < 12]
    mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
    mystery_three = mystery_two.get(__(iii)__).iloc[-1]

Answer:

  • (i): "Stage"
    To filter the DataFrame to include only rows corresponding to stages before Stage 12, we use the "Stage" column. The condition stages.get("Stage") < 12 creates a boolean mask that selects only the rows where the stage number is less than 12.

  • (ii): True
    To find the longest stage, the rows need to be sorted by the "Distance" column. Setting ascending=True ensures that shorter stages come first and the longest stage appears last in the sorted DataFrame.

  • (iii): "Destination"
    After sorting, we want to retrieve the "Destination" of the longest stage. Using .get("Destination") retrieves the "Destination" column, and .iloc[-1] accesses the last row in the sorted DataFrame, corresponding to the longest stage before Stage 12.


Difficulty: ⭐️

The average score on this problem was 92%.



Source: su24-midterm — Q2

Problem 23

Fill in the blanks so that the expression below evaluates to the region in the "Region" column that comes first alphabetically.

kart.groupby(___(a)___).count().index.___(b)___

What goes in (a)?

What goes in (b)?

Answer:

  • (a) "Region"
  • (b) "min()"

In order to curate a unique list of all the regions in the DataFrame, we want to groupby “Region". Using count() on this will count the number of observations for each region as a row in the DataFrame which does not really matter here since we just wanted the unique regions. .index will retrieve the list of all the unique regions. In order to get the first alphabetically, we can use min().


Difficulty: ⭐️

The average score on this problem was 100%.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Source: su24-midterm — Q3

Problem 24

Fill in the blanks so that the expression below evaluates to the "University" of the lowest scoring Division 2 team over the whole season.

kart[__(a)__].__(b)__(__(c)__).get("University").iloc[0]

Answer:

  • (a) kart.get("Division") == "Division 2"
  • (b) sort values
  • (c) "Total Points" or by="Total Points"

First we want to filter the DataFrame for only the Division 2 teams using kart.get("Division") == "Division 2".

Since we are interested in the lowest scoring team in this division, we want to use sort_values on the "Total Points" column giving sort_values(by= “Total Points”). Since we are retrieving .iloc[0], i.e. the top row of the DataFrame, we do not have to specify the order of ranking since the default behavior of sort_values is ascending.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Difficulty: ⭐️

The average score on this problem was 90%.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 25

Suppose df is a DataFrame and b is any boolean array whose length is the same as the number of rows of df.

True or False: For any such boolean array b, df[b].shape[0] is less than or equal to df.shape[0].

Answer: True

The brackets in df[b] perform a query, or filter, to keep only the rows of df for which b has a True entry. Typically, b will come from some condition, such as the entry in a certain column of df equaling a certain value. Regardless, df[b] contains a subset of the rows of df, and .shape[0] counts the number of rows, so df[b].shape[0] must be less than or equal to df.shape[0].


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 26

You are given a DataFrame called books that contains columns 'author' (string), 'title' (string), 'num_chapters' (int), and 'publication_year' (int).

Suppose that after doing books.groupby('Author').max(), one row says

author title num_chapters publication_year
Charles Dickens Oliver Twist 53 1838


Problem 26.1

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?

Answer: No

When we group by 'Author', all books by the same author get aggregated together into a single row. The aggregation function is applied separately to each other column besides the column we’re grouping by. Since we’re grouping by 'Author' here, the 'Author' column never has the max() function applied to it. Instead, each unique value in the 'Author' column becomes a value in the index of the grouped DataFrame. We are told that the Charles Dickens row is just one row of the output, but we don’t know anything about the other rows of the output, or the other authors. We can’t say anything about where Charles Dickens falls when authors are ordered alphabetically (but it’s probably not last!)


Difficulty: ⭐️

The average score on this problem was 94%.


Problem 26.2

Based on this data, can you conclude that Charles Dickens wrote Oliver Twist?

Answer: Yes

Grouping by 'Author' collapses all books written by the same author into a single row. Since we’re applying the max() function to aggregate these books, we can conclude that Oliver Twist is alphabetically last among all books in the books DataFrame written by Charles Dickens. So Charles Dickens did write Oliver Twist based on this data.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 26.3

Based on this data, can you conclude that Oliver Twist has 53 chapters?

Answer: No

The key to this problem is that groupby applies the aggregation function, max() in this case, independently to each column. The output should be interpreted as follows:

  • Among all books in books written by Charles Dickens, Oliver Twist is the title that is alphabetically last.
  • Among all books in books written by Charles Dickens, 53 is the greatest number of chapters.
  • Among all books in books written by Charles Dickens, 1838 is the latest year of publication.

However, the book titled Oliver Twist, the book with 53 chapters, and the book published in 1838 are not necessarily all the same book. We cannot conclude, based on this data, that Oliver Twist has 53 chapters.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 26.4

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?

Answer: No

As explained in the previous question, the max() function is applied separately to each column, so the book written by Charles Dickens with 53 chapters may not be the same book as the book written by Charles Dickens published in 1838.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.



Problem 27

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.


Problem 27.1

sky.sort_values('height')

Answer: DataFrame

sky is a DataFrame. All the sort_values method does is change the order of the rows in the Series/DataFrame it is called on, it does not change the data structure. As such, sky.sort_values('height') is also a DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 27.2

sky.sort_values('height').get('material').loc[0]

Answer: error

sky.sort_values('height') is a DataFrame, and sky.sort_values('height').get('material') is a Series corresponding to the 'material' column, sorted by 'height' in increasing order. So far, there are no errors.

Remember, the .loc accessor is used to access elements in a Series based on their index. sky.sort_values('height').get('material').loc[0] is asking for the element in the sky.sort_values('height').get('material') Series with index 0. However, the index of sky is made up of building names. Since there is no building named 0, .loc[0] causes an error.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 27.3

sky.sort_values('height').get('material').iloc[0]

Answer: string

As we mentioned above, sky.sort_values('height').get('material') is a Series containing values from the 'material' column (but sorted). Remember, there is no element in this Series with an index of 0, so sky.sort_values('height').get('material').loc[0] errors. However, .iloc[0] works differently than .loc[0]; .iloc[0] will give us the first element in a Series (independent of what’s in the index). So, sky.sort_values('height').get('material').iloc[0] gives us back a value from the 'material' column, which is made up of strings, so it gives us a string. (Specifically, it gives us the 'material' type of the skyscraper with the smallest 'height'.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 27.4

sky.get('floors').max()

Answer: int or float

The Series sky.get('floors') is made up of integers, and sky.get('floors').max() evaluates to the largest number in the Series, which is also an integer.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 27.5

sky.index[0]

Answer: string

sky.index contains the values 'Bayard-Condict Building', 'The Yacht Club at Portofino', 'City Investing Building', etc. sky.index[0] is then 'Bayard-Condict Building', which is a string.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 28

sky.groupby('material').max()

Answer: DataFrame

When grouping and using an aggregation method, the result is always a DataFrame. The DataFrame sky.groupby('material').max() contains all of the columns in sky, minus 'material', which has been moved to the index. It contains one row for each unique 'material'.

Note that no columns were “dropped”, as may happen when using .mean(), because .max() can work on Series’ of any type. You can take the max of strings, while you cannot take the mean of strings.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 29

In this question, we’ll write code to learn more about the skyscrapers in the beautiful city of San Diego. (Unrelated fun fact – since the San Diego Airport is so close to downtown, buildings in downtown San Diego legally cannot be taller than 152 meters.)


Problem 29.1

Below, fill in the blank to create a DataFrame, named san_tall, consisting of just the skyscrapers in San Diego that are over 100 meters tall.

condition = ______
san_tall = sky[(sky.get('city') == 'San Diego') & condition]

What goes in the blank?

Answer: sky.get('height') > 100

We need to query for all of the skyscrapers that satisfy two conditions – the 'city' must be 'San Diego' and the 'height' must be above 100. The first condition was already implemented for us, so we just need to construct a Boolean Series that implements the second condition.

Here, we want all of the rows where 'height' is above 100, so we get the 'height' column and compare it to 100 like so: sky.get('height') > 100.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 29.2

Suppose san_tall from the previous part was created correctly. Fill in the blanks so that height_many_floors evaluates to the height (in meters) of the skyscraper with the most floors, amongst all skyscrapers in San Diego that are over 100 meters tall.

height_many_floors = san_tall.______.iloc[0]

What goes in the blank?

Answer: sort_values('floors', ascending=False).get('height')

The end of the line given to us is .iloc[0]. We know that .iloc[0] extracts the first element in whatever Series it is called on, so what comes before .iloc[0] must be a Series where the first element is the 'height' of the skyscraper with the most floors, among all skyscrapers in San Diego that are over 100 meters tall. The DataFrame we are working with, san_tall, already only has skyscrapers in San Diego that are over 100 meters tall.

This means that in the blank, all we need to do is:

  1. Sort skyscrapers by 'floors' in decreasing order (so that the first row is the skyscraper with the most 'floors').
  2. Extract the 'height' column.

As such, a complete answer is height_many_floors = san_tall.sort_values('floors', ascending=False).get('height').iloc[0].


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 29.3

height_many_floors, the value you computed in the previous part (2.2) was a number.

True or False: Assuming that the DataFrame san_tall contains all skyscrapers in San Diego, height_many_floors is the height (in meters) of the tallest skyscraper in San Diego.

Answer: False

height_many_floors is the height of the skyscraper with the most 'floors'. However, this is not necessarily the tallest skyscraper (i.e. the skyscraper with the largest 'height')! Consider the following scenario:

  • Building A: 15 floors, height of 150 feet
  • Building B: 20 floors, height of 100 feet

height_many_floors would be 100, but it is not the 'height' of the taller building.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.



Problem 30

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.


Problem 30.1

sky.get('city').apply(len)

Answer: Series

The .apply method takes in a function and evaluates that function on every element in a Series. Here, sky.get('city').apply(len) is using the function len on every element in the Series sky.get('city'). The result is also a Series, containing the lengths of the names of each 'city'.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 30.2

sky.get('city').apply(max)

Answer: Series

This is a tricky problem!

The function that apply takes in must work on individual elements in a Series, i.e. it must work with just a single argument. We saw this in the above subpart, where sky.get('city').apply(len) applied len on each 'city' name.

Here, we are trying to apply the max function on each 'city' name. The max of a single item does not work in Python, because taking the max requires comparing two or more elements. Try it out - in a notebook, run the expression max(5), and you’ll see an error. So, if we tried to use .apply(max) on a Series of numbers, we’d run into an error.

However, we are using .apply(max) on a Series of strings, and it turns out that Python does allow us to take the max of a string! The max of a string in Python is defined as the last character in the string alphabetically, so max('hello') evaluates to 'o'. This means that sky.get('city').apply(max) does actually run without error; it evaluates to a Series containing the last element in the name of each 'city'.

(This subpart was trickier than we intended – we ended up giving credit to both “error” and “Series”.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.



Problem 31

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.


Problem 31.1

sky.sort_values('height')

Answer: DataFrame

sky is a DataFrame. All the sort_values method does is change the order of the rows in the Series/DataFrame it is called on, it does not change the data structure. As such, sky.sort_values('height') is also a DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 31.2

sky.sort_values('height').get('material').loc[0]

Answer: error

sky.sort_values('height') is a DataFrame, and sky.sort_values('height').get('material') is a Series corresponding to the 'material' column, sorted by 'height' in increasing order. So far, there are no errors.

Remember, the .loc accessor is used to access elements in a Series based on their index. sky.sort_values('height').get('material').loc[0] is asking for the element in the sky.sort_values('height').get('material') Series with index 0. However, the index of sky is made up of building names. Since there is no building named 0, .loc[0] causes an error.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 31.3

sky.sort_values('height').get('material').iloc[0]

Answer: string

As we mentioned above, sky.sort_values('height').get('material') is a Series containing values from the 'material' column (but sorted). Remember, there is no element in this Series with an index of 0, so sky.sort_values('height').get('material').loc[0] errors. However, .iloc[0] works differently than .loc[0]; .iloc[0] will give us the first element in a Series (independent of what’s in the index). So, sky.sort_values('height').get('material').iloc[0] gives us back a value from the 'material' column, which is made up of strings, so it gives us a string. (Specifically, it gives us the 'material' type of the skyscraper with the smallest 'height'.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 31.4

sky.get('floors').max()

Answer: int or float

The Series sky.get('floors') is made up of integers, and sky.get('floors').max() evaluates to the largest number in the Series, which is also an integer.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 31.5

sky.groupby('material').max()

Answer: DataFrame

When grouping and using an aggregation method, the result is always a DataFrame. The DataFrame sky.groupby('material').max() contains all of the columns in sky, minus 'material', which has been moved to the index. It contains one row for each unique 'material'.

Note that no columns were “dropped”, as may happen when using .mean(), because .max() can work on Series’ of any type. You can take the max of strings, while you cannot take the mean of strings.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 31.6

sky.index[0]

Answer: string

sky.index contains the values 'Bayard-Condict Building', 'The Yacht Club at Portofino', 'City Investing Building', etc. sky.index[0] is then 'Bayard-Condict Building', which is a string.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 32

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.


Problem 32.1

sky.sort_values('height')

Answer: DataFrame

sky is a DataFrame. All the sort_values method does is change the order of the rows in the Series/DataFrame it is called on, it does not change the data structure. As such, sky.sort_values('height') is also a DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 32.2

sky.sort_values('height').get('material').loc[0]

Answer: error

sky.sort_values('height') is a DataFrame, and sky.sort_values('height').get('material') is a Series corresponding to the 'material' column, sorted by 'height' in increasing order. So far, there are no errors.

Remember, the .loc accessor is used to access elements in a Series based on their index. sky.sort_values('height').get('material').loc[0] is asking for the element in the sky.sort_values('height').get('material') Series with index 0. However, the index of sky is made up of building names. Since there is no building named 0, .loc[0] causes an error.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 32.3

sky.sort_values('height').get('material').iloc[0]

Answer: string

As we mentioned above, sky.sort_values('height').get('material') is a Series containing values from the 'material' column (but sorted). Remember, there is no element in this Series with an index of 0, so sky.sort_values('height').get('material').loc[0] errors. However, .iloc[0] works differently than .loc[0]; .iloc[0] will give us the first element in a Series (independent of what’s in the index). So, sky.sort_values('height').get('material').iloc[0] gives us back a value from the 'material' column, which is made up of strings, so it gives us a string. (Specifically, it gives us the 'material' type of the skyscraper with the smallest 'height'.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 32.4

sky.get('city').apply(len)

Answer: Series

The .apply method takes in a function and evaluates that function on every element in a Series. Here, sky.get('city').apply(len) is using the function len on every element in the Series sky.get('city'). The result is also a Series, containing the lengths of the names of each 'city'.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 32.5

sky.get('city').apply(max)

Answer: Series

This is a tricky problem!

The function that apply takes in must work on individual elements in a Series, i.e. it must work with just a single argument. We saw this in the above subpart, where sky.get('city').apply(len) applied len on each 'city' name.

Here, we are trying to apply the max function on each 'city' name. The max of a single item does not work in Python, because taking the max requires comparing two or more elements. Try it out - in a notebook, run the expression max(5), and you’ll see an error. So, if we tried to use .apply(max) on a Series of numbers, we’d run into an error.

However, we are using .apply(max) on a Series of strings, and it turns out that Python does allow us to take the max of a string! The max of a string in Python is defined as the last character in the string alphabetically, so max('hello') evaluates to 'o'. This means that sky.get('city').apply(max) does actually run without error; it evaluates to a Series containing the last element in the name of each 'city'.

(This subpart was trickier than we intended – we ended up giving credit to both “error” and “Series”.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 32.6

sky.get('floors').max()

Answer: int or float

The Series sky.get('floors') is made up of integers, and sky.get('floors').max() evaluates to the largest number in the Series, which is also an integer.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 32.7

sky.groupby('material').max()

Answer: DataFrame

When grouping and using an aggregation method, the result is always a DataFrame. The DataFrame sky.groupby('material').max() contains all of the columns in sky, minus 'material', which has been moved to the index. It contains one row for each unique 'material'.

Note that no columns were “dropped”, as may happen when using .mean(), because .max() can work on Series’ of any type. You can take the max of strings, while you cannot take the mean of strings.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 32.8

sky.index[0]

Answer: string

sky.index contains the values 'Bayard-Condict Building', 'The Yacht Club at Portofino', 'City Investing Building', etc. sky.index[0] is then 'Bayard-Condict Building', which is a string.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 33

In this question, we’ll write code to learn more about the skyscrapers in the beautiful city of San Diego. (Unrelated fun fact – since the San Diego Airport is so close to downtown, buildings in downtown San Diego legally cannot be taller than 152 meters.)

Below, fill in the blank to create a DataFrame, named san_tall, consisting of just the skyscrapers in San Diego that are over 100 meters tall.

condition = ______
san_tall = sky[(sky.get('city') == 'San Diego') & condition]

What goes in the blank?

Answer: sky.get('height') > 100

We need to query for all of the skyscrapers that satisfy two conditions – the 'city' must be 'San Diego' and the 'height' must be above 100. The first condition was already implemented for us, so we just need to construct a Boolean Series that implements the second condition.

Here, we want all of the rows where 'height' is above 100, so we get the 'height' column and compare it to 100 like so: sky.get('height') > 100.


Difficulty: ⭐️

The average score on this problem was 95%.


Source: wi23-final — Q2

Problem 34

You are unsure whether it would make sense to use "BGG Rank" as the index of the games DataFrame, because you are unsure whether this column has duplicate values. Perhaps, for example, two games are tied and both have a rank of 6.

Select all of the expressions below that evaluate to True when the "BGG Rank" column could be used as the index (no duplicates), and False when it could not be used as the index (duplicates). In other words, these are the expressions that can be used to detect the presence of duplicate values.

Note: We will not set the index of games, instead we’ll leave it with the default index.

Answer: Options 2 and 3

Option 2: games.groupby(“BGG Rank”).count() gets all of the unique “BGG Rank”’s and puts them into the index. Then by using the aggregate function .count() we are able to turn all the remaining columns into the number of times each “BGG Rank” appears. Since all of the columns are the same we just need to get one of them to access the counts. In this case we get the column “Name” by doing .get(“Name”). Finally, when we do .max() == 1 we are checking to see if the maximum count for the number of unique “BGG Rank”’s is one, which would mean there are no duplicates.

Option 3: Let’s work from the inside out for this line of code: len(np.unique(games.get(“BGG Rank”))). Like all the others we are getting a Series of “BGG Rank”. np.unique() gives us an array with unique elements inside of the Series. When we do len() we are figuring out how many unique elements there are inside of “BGG Rank”. Recall games.shape[0] gets us the number of rows in games. This means that we are trying to see if the number of rows is the same as the number of unique elements inside of “BGG Rank”, and if they are then that means they are all unique and should equal 0.

Option 1: games.get(“BGG Rank”) will get you a Series of the “BGG Rank” column. np.arange(games.shape[0]) will create a numpy array that will go from zero to games.shape[0], which is the number of rows in the games DataFrame. So it would look something like: arr([0, 1, 2, . . ., n]), where n is the number of rows in games. By doing: games.get(“BGG Rank”) - np.arange(games.shape[0]) one is decreasing each rank by an increasing factor of one each time. This essentially gives a Series of numbers, but it doesn’t actually have anything to do with uniqueness. We are simply finding if the difference between “BGG Rank” and the numpy array leads to a maximum of 1. So although the code works it does not tell us if there are duplicates.

Option 4: games.get(“BGG Rank”).max() will give us the maximum element inside of “BGG Rank”. Note, games.shape[0] gets us the number of rows in games. We should never make assumptions about what is inside of “BGG Rank”. This means we don’t know if the values line up nicely like: 1, 2, 3, . . . games.shape[0], so the maximum element could be unique, but be bigger than games.shape[0]. Knowing this, when we do the whole line for Option 4 it is not guaranteed to be zero when “BGG Rank” is unique or not, so it does not detect duplicates.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Source: wi23-midterm — Q3

Problem 35


Problem 35.1

The DataFrame amelia was created by the code below, and is shown in its entirety.

amelia = (storms[(storms.get("Name") == "AMELIA") &
         (storms.get("Year") == 1978)]
         .get(["Year", "Month", "Day", "Time",
         "Status", "Latitude", "Longitude"]))


Use the space provided to show the DataFrame that results from

        amelia.groupby("Status").max()

The column labels should go in the top row, and the row labels (index) should go in the leftmost row. You may not need to use all the rows and columns provided.


Answer:

Status Year Month Day Time Latitude Longtitude
TD 1978 8 31 6pm 29.3N 99.2W
TS 1978 7 31 6am 28.0N 98.2W

Remember that calling .groupby('column name') sets the column name to the index. This means the 'Status' column will become the new bolded index, which is found in the leftmost column of the data frame. Another thing to know is Python will organize strings in the index in alphabetical order, so for example TS and TD both start with T, but D comes sooner in the alphabet than S, which makes row TD come before row TS.

Next it is important to look at the .max(), which tells us that we want the maximum element of each column that correspond to the unique 'Status'. Recall how .max() interacts with strings: .max() will organize strings in the columns in descending order, so the last alphabetically/numerically.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.



Source: wi24-final — Q2

Problem 36

Frank X. Kugler has Olympic medals in three sports (wrestling, weightlifting, and tug of war), which is more sports than any other Olympic medalist. Furthermore, his medals for all three of these sports are included in the olympians DataFrame. Fill in the blanks below so that the expression below evaluates to "Frank X. Kugler".

                (olympians.groupby(__(a)__).__(b)__
                          .reset_index()
                          .groupby(__(c)__).__(d)__
                          .sort_values(by="Age", ascending=False)
                          .index[0])


Problem 36.1

What goes in blank (a)?

Answer: ["Name", "Sport"] or ["Sport", "Name"]

The question wants us to find the name (Frank X. Kugler) who has records that correspond to three distinct sports. We know that the same athlete might have multiple records for a distinct sport if they participated in the same sport for multiple years. Therefore we should groupby "Name" and "Sport" to create a DataFrame with unique Name-Sport pairs. This is a DataFrame that contains the athletes and their sports (for each athlete, their corresponding sports are distinct). If an athlete participated in 2 sports, for example, they would have 2 rows corresponding to them in the DataFrame, 1 row for each distinct sport.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 36.2

What goes in blank (b)?

Answer: .sum() or .mean() or .min(), etc.

Any aggregation method applied on df.groupby() would work. We only want to remove cases in which an athlete participates in the same sport for multiple years, and get unique name-sport pairs. Therefore, we don’t care about the aggregated numeric value. Notice .unique() is not correct because it is not an aggregation method used on dataframe after grouping by. If you use .unique(), it will give you “AttributeError: ‘DataFrameGroupBy’ object has no attribute ‘unique’”. However, .unique() can be used after Series.groupby(). For more info: [link] (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.unique.html)


Difficulty: ⭐️

The average score on this problem was 96%.


Problem 36.3

What goes in blank (c)?

Answer: "Name"

Now after resetting the index, we have "Name" and "Sport" columns containing unique name-sport pairs. The objective is to count how many different sports each Olympian has medals in. To do that, we groupby "Name" and later use the .count() method. This would give a new DataFrame that has a count of how many times each name shows up in our previous DataFrame with unique name-sport pairs.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 36.4

What goes in blank (d)?

Answer: .count()

The .count() method is applied to each group. In this context, .count() will get the number of entries for each Olympian across different sports, since the previous steps ensured each sport per Olympian is uniquely listed (due to the initial groupby on both "Name" and "Sport"). It does not matter what we sort_values by, because the .groupby('Name').count() method will just put a count of each "Name" in all of the columns, regardless of the column name or what value was originally in it.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.



Source: wi24-midterm — Q1

Problem 37

Each of the following expressions evaluates to an integer. Determine the value of that integer, if possible, or circle “not enough information."

Important: Before proceeding, make sure to read the page called Clue: The Murder Mystery Game.


Problem 37.1

(clue.get("Cardholder") == "Janine").sum() 

Answer: 6

This code counts the number of times that Janine appears in the Cardholder column. This is because clue.get("Cardholder") == "Janine" will return a Series of True and False values of length 22 where True corresponds to a card belonging to Janine. Since 6 cards were dealt to her, the expression evaluates to 6.


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 37.2

np.count_nonzero(clue.get("Category").str.contains("p")) 

Answer: 13

This code counts the number of cells that contain that letter "p" in the Category column. clue.get("Category").str.contains("p") will return a Series that contains True if "p" is part of the entry in the "Category" column and False otherwise. The words "suspect" and "weapons" both contain the letter "p" and since there are 6 and 7 of each respectively, the expression evaluates to 13.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 37.3

clue[(clue.get("Category") == "suspect") & (clue.get("Cardholder") == "Janine")].shape[0] 

Answer: not enough information

This code first filters only for rows that contain both "suspect" as the category and "Janine" as the cardholder and returns the number of rows of that DataFrame with .shape[0]. However, from the information given, we do not know how many "suspect" cards Janine has.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 37.4

len(clue.take(np.arange(5, 20, 3)).index) 

Answer: 5

np.arange(5, 20, 3) is the arary np.array([5, 8, 11, 14, 17]). Recall that .take will filter the DataFrame to contain only certain rows, in this case rows 5, 8, 11, 14, and 17. Next, .index extracts the index of the DataFrame, so the length of the index is the same as the number of rows contained in the DataFrame. There are 5 rows.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 37.5

len(clue[clue.get("Category") >= "this"].index) 

Answer: 7

Similarly to the previous problem, we are getting the number of rows of the DataFrame clue after filtering it. clue.get("Category") >= "this" returns a Boolean Series where True is returned when a string in "Category" is greater than alphabetically than "this". This only happens when the string is "weapon", which occurs 7 times.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 29%.


Problem 37.6

clue.groupby("Cardholder").count().get("Category").sum() 

Answer: 22

groupby("Cardholder").count() will return a DataFrame indexed by "Cardholder" where each column contains the number of cards that each "Cardholder" has. Then we sum the values in the "Category" column, which evaluates to 22 because the sum of the total number of cards each cardholder has is the total number of cards in play!


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.



Source: wi24-midterm — Q3

Problem 38

An important part of the game is knowing when you’ve narrowed it down to just one suspect with one weapon in one room. Then you can make your accusation and win the game!

Suppose the DataFrames grouped and filtered are defined as follows.

    grouped = (clue.reset_index()
                   .groupby(["Category", "Cardholder"])
                   .count()
                   .reset_index())
    filtered = grouped[grouped.get("Cardholder") == "Unknown"]


Problem 38.1

Fill in the blank below so that "Ready to accuse" is printed when Janine has enough information to make an accusation and win the game.

    if filtered.get("Card").______ == 3:
        print("Ready to accuse")

What goes in the blank?

Answer: sum()

It is helpful to first visualize how both the grouped (left) and filtered (right) DataFrames could look:

Image 1

grouped DataFrame contains the number of cards for a certain "Category"/"Cardholder" combination.

Image 2

filtered DataFrame contains the number of cards that are "Unknown" by "room", "suspect", and "weapon".

Now, let’s think about the scenario presented. We want a method that will return 3 from filtered.get("Card").___. We do not use count() because that is an aggregation function that appears after a .groupby, and there is no grouping here.

According to the instructions, we want to know when we narrowed it down to just one suspect with one weapon in one room. This means for filtered DataFrame, each row should have 1 in the "Card" column when you are already to accuse. sum() works because when you have only 1 unknown card for each of the three categories, that means you have a sum of 3 unknown cards in total. You can make an accusation now!


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.


Problem 38.2

Now, let’s look at a different way to do the same thing. Fill in the blank below so that "Ready to accuse" is printed when Janine has enough information to make an accusation and win the game.

    if filtered.get("Card").______ == 1:
        print("Ready to accuse")

What goes in the blank?

Answer: max()

This problem follows the same logic as the first except we only want to accuse when filtered.get("Card").___ == 1. As we saw in the previous part, we only want to accuse when all the numbers in the "Card" column are 1, as this represents one unknown in each category. This means the largest number in the "Card" column must be 1, so we can fill in the blank with max().


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 40%.



Source: wi25-final — Q2

Problem 39

The Death Eaters are a powerful group of dark wizards who oppose Harry Potter and his allies. Each Death Eater receives a unique identification number based on their order of initiation, ranging from 1 to N, where N represents the total number of Death Eaters.

Your task is to estimate the value of N so you can understand how many enemies you face. You have a random sample of identification numbers in a DataFrame named death_eaters containing a single column called "ID".


Problem 39.1

Which of the options below would be an appropriate estimate for the total number of Death Eaters? Select all that apply.

Answer: death_eaters.get("ID").max() and int(death_eaters.get("ID").mean() * 2)

  • Option 1: death_eaters.get("ID").max() returns the maximum ID from the sample. This is an appropriate estimate since the population size must be at least the size of the largest ID in our sample. For instance, if the maximum ID observed is 250, then the total number of Death Eaters must be at least 250.

  • Option 2: death_eaters.get("ID").sum() returns the sum of all ID numbers in the sample. The total sum of IDs has no meaningful connection to the population size, which makes this an inappropriate estimate.

  • Option 3: death_eaters.groupby("ID").count() groups the data by ID and counts occurrences. Since each ID is unique and death_eaters only includes the "ID" column, grouping simply shows that each ID appears once. This is not an appropriate estimate for N.

  • Option 4: int(death_eaters.get("ID").mean() * 2) returns twice the mean of the sample IDs as an integer. The mean of a random sample of the numbers 1 through N usually falls about halfway between 1 and N. So we can appropriately estimate N by doubling this mean.

  • Option 5: death_eaters.shape[0] returns the number of rows in death_eaters (ie. the sample size). The sample size does not reflect the total population size, making it an inappropriate estimate.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 39.2

Each box that you selected in part (a) is an example of what?

Answer: a statistic

The options in part (a) calculate a numerical value from the random sample death_eaters. This fits the definition of a statistic.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 39.3

Suppose you have access to a function called estimate, which takes in a Series of Death Eater ID numbers and returns an estimate for N. Fill in the blanks below to do the following:

    boot_estimates = np.array([])
    
    for i in np.arange(10000):
        boot_estimates = np.append(boot_estimates, __(a)__)

    left_72 = __(b)__
    

What goes in blank (a)?

Answer: estimate(death_eaters.sample(death_eaters.shape[0], replace=True).get("ID"))

In the given code, we use a for loop to generate 10,000 bootstrapped estimates of N and append them to the array boot_estimates. Blank (a) specifically computes one bootstrapped estimate of N. Here’s how key parts of the solution work:

  • death_eaters.sample(death_eaters.shape[0], replace=True): To bootstrap, we need to resample the data with replacement. The sample() function (see here) takes as arguments the sample size (death_eaters.shape[0]) and whether to replace (replace=True).

  • .get("ID"): Since estimate() takes a Series as input, we need to extract the ID column from the resample.

  • estimate(): The resampled ID column is passed into the estimate() function to generate one bootstrapped estimate of N.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 39.4

What goes in blank (b)?

Answer: np.percentile(boot_estimates, 14)

A 72% confidence interval captures the middle 72% of our distribution. This leaves 28% of the data outside the interval, with 14% from the lower tail and 14% from the upper tail. Thus, the left endpoint corresponds to the 14th percentile of boot_estimates. The np.percentile() function (see here) takes as arguments the array to compute the percentile (boot_estimates) and the desired percentile (14).


Difficulty: ⭐️

The average score on this problem was 91%.