Lecture 4 — Practice

← return to practice.dsc10.com


Lecture 4 — Collected Practice Questions

Below are practice problems tagged for Lecture 4 (rendered directly from the original exam/quiz sources).


Problem 1


Problem 1.1

Which of these would it make sense to use as the index of flights?

Answer: None of these are good choices for the index

When choosing an index, we have to make sure that the index is different for each row of the DataFrame. The index in this case should uniquely identify the flight.

'DATE'does not uniquely identify a flight because there are many different flights in a single day. 'FLIGHT' does not uniquely identify a flight because airlines reuse flight numbers on a daily basis, as we are told in the data description. Neither 'FROM' nor 'TO' uniquely identifies a flight, as there are many flights each day that depart from each airport and arrive at each airport.

Therefore, there is no single column that’s sufficient to uniquely identify a flight, but if we could use multiple columns to create what’s called a multi-index, we’d probably want to use 'DATE' and 'FLIGHT' because each row of our DataFrame should have a unique pair of values in these columns. That’s because airlines don’t reuse flight numbers within a single day.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.


Problem 1.2

What type of variable is 'FLIGHT'?

Answer: Categorical

'FLIGHT' is a categorical variable because it doesn’t make sense to do arithmetic with the values in the 'FLIGHT' column. 'FLIGHT' is just a label for each flight, and the fact that it includes some numbers does not make it numerical. We could have just as well used letter codes to distinguish flights.


Difficulty: ⭐️

The average score on this problem was 98%.



Problem 2

Fill in the blanks below so that the result is a DataFrame with the same columns as flights plus a new column, 'SPEED', containing the average speed of each flight, in miles per hour.

flights.__(a)__(SPEED=__(b)__)


Problem 2.1

What goes in blank (a)?

Answer: assign

We want to add a new column, so we must use assign. We can also tell that the answer will be assign because it’s the only DataFrame method that takes an input of the form SPEED=___. Remember that when using assign, we get to call the new column anything we want, and we don’t use quotes around its name.


Difficulty: ⭐️

The average score on this problem was 100%.


Problem 2.2

What goes in blank (b)?

Answer: (flights.get('DIST') / flights.get('HOURS'))

In this blank, we’ll need a Series or array containing the average speed of each flight, in miles per hour.

To calculate the average speed of an individual flight in miles per hour, we’d simply divide the total number of miles by the total amount of time in hours. For example, a flight that travels 500 miles in one hour travels at 500 miles per hour. Note that this is an average speed; at some points of the journey, the plane may have been moving faster than this speed, at other times slower. Because we are calculating an average speed for the whole trip by simply dividing, we don’t need to use .mean().

Once we know how to calculate the average speed for an individual flight, we can do the same operation on each flight all at once using Series arithmetic. flights.get('DIST') is a Series containing the distances of each flight, and flights.get('HOURS') is a Series containing the times of each flight, in the same order. When we divide these two Series, corresponding entries are divided and the result is a Series of average speeds for each flight, as desired.


Difficulty: ⭐️

The average score on this problem was 93%.



Source: fa24-final — Q1

Problem 3


Problem 3.1

Notice that bookstore has an index of "ISBN" and sales does not. Why is that?

Answer: The bookstore can sell multiple copies of the same book.

In the sales DataFrame, each row represents an individual sale, meaning multiple rows can have the same "ISBN" if multiple copies of the same book are sold. Therefore we can’t use it as the index because it is not a unique identifier for rows of sales.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 3.2

Is "ISBN" a numerical or categorical variable?

Answer: categorical

Even though "ISBN" consists of numbers, it is used to identify and categorize books rather than to quantify or measure anything, thus it is categorical. It doesn’t make sense to compare ISBN numbers like you would compare numbers on a number line, or to do arithmetic with ISBN numbers.


Difficulty: ⭐️⭐️

The average score on this problem was 75%.


Problem 3.3

Which type of data visualization should be used to compare authors by median rating?

Answer: bar chart

A bar chart is best, as it visualizes numerical values (median ratings) across discrete categories (authors).


Difficulty: ⭐️⭐️

The average score on this problem was 88%.



Source: fa24-midterm — Q1

Problem 4

Which of the following columns would be an appropriate index for the treat DataFrame?

Answer: None of these.

The index uniquely identifies each row of a DataFrame. As a result, for a column to be a candidate for the index, it must not contain repeat items. Since it is possible for an address to give out different types of candy, values in "address" can show up multiple times. Similarly, values in "candy" can also show up multiple times as it will appear anytime a house gives it out. Finally, a neighborhood has multiple houses, so if more than one of those houses show up, that value in "neighborhood" will appear multiple times. Since "address", "candy", and "neighborhood" can potentially have repeat values, none of them can be the index for treat.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 54%.


Source: fa24-midterm — Q2

Problem 5

Which of the following expressions evaluate to "M&M"? Select all that apply.

Answer: treat.get("candy").iloc[1] and treat.sort_values(by="candy", ascending = False).get("candy").loc[1]

  • Option 1: treat.get("candy").iloc[1] gets the candy column and then retrieves the value at index location 1, which would be "M&M".

  • Option 2: treat.sort_values(by="candy", ascending=False).get("candy").iloc[1] sorts the candy column in descending order (alphabetically, the last candy is at the top) and then retrieves the value at index location 1 in the candy column. The entire dataset is not shown, but in the given rows, the second-to-last candy alphabetically is "Skittles", so we know that "M&M" will not be the second-to-last alphabetical candy in the full dataset.

  • Option 3: treat.sort_values(by="candy", ascending=False).get("candy").loc[1] is very similar to the last option; however, this time, .loc[1] is used instead of .iloc[1]. This means that instead of looking at the row in position 1 (second row) of the sorted DataFrame, we are finding the row with an index label of 1. When the rows are sorted by candy in descending order, the index labels remain with their original rows, so the "M&M" row is retrieved when we search for the index label 1.

  • Option 4: treat.set_index("candy").index[-1] sets the index to the candy column and then retrieves the last element in the index (candy). The entire dataset is not shown, but in the given rows, the last value would be "Skittles" and not "M&M". The last value of the full dataset could be "M&M", but since we are not sure, this option is not selected.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.


Problem 6

Complete the expression below so that it evaluates to the name of the product for which the average assembly cost per package is lowest.

(ikea.assign(assembly_per_package = ___(a)___)
     .sort_values(by='assembly_per_package').___(b)___)


Problem 6.1

What goes in blank (a)?

Answer: ikea.get('assembly_cost')/ikea.get('packages')

This column, as its name suggests, contains the average assembly cost per package, obtained by dividing the total cost of each product by the number of packages that product comes in. This code uses the fact that arithmetic operations between two Series happens element-wise.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 6.2

What goes in blank (b)?

Answer: get('product').iloc[0]

After adding the 'assembly_per_package' column and sorting by that column in the default ascending order, the product with the lowest 'assembly_per_package' will be in the very first row. To access the name of that product, we need to get the column containing product names and use iloc to access an element of that Series by integer position.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 7

Assume you have a DataFrame named ikea that contains information about IKEA products, including columns called 'product'(str): the name of the product, 'assembly_cost'(int): the assembly cost of each product, and 'packages'(int): the number of packages each product comes in. Complete the expression below so that it evaluates to the name of the product for which the average assembly cost per package is lowest.

(ikea.assign(assembly_per_package = ___(a)___)
     .sort_values(by='assembly_per_package').___(b)___)


Problem 7.1

What goes in blank (a)?

Answer: ikea.get('assembly_cost')/ikea.get('packages')

This column, as its name suggests, contains the average assembly cost per package, obtained by dividing the total cost of each product by the number of packages that product comes in. This code uses the fact that arithmetic operations between two Series happens element-wise.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 7.2

What goes in blank (b)?

Answer: get('product').iloc[0]

After adding the 'assembly_per_package' column and sorting by that column in the default ascending order, the product with the lowest 'assembly_per_package' will be in the very first row. To access the name of that product, we need to get the column containing product names and use iloc to access an element of that Series by integer position.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 8

Which of the following is a valid reason not to set the index of sungod to 'Artist'? Select all correct answers.

Answer: Two different artists have the same name., An artist performed at Sun God in more than one year.

For this question, it is crucial to know that an index should not contain duplicate values, so we need to consider reasons why 'Artist' might contain two values that are the same. Let’s go through the answer choices in order.

For the first option, if two different artists had the same name, this would lead to duplicate values in the 'Artist' column. Therefore, this is a valid reson not to index sungod by 'Artist'.

For the second option, if one artist performed at Sun God in more than one year, their name would appear multiple times in the 'Artist' column, once for each year they performed. This would also be a valid reason not to index sungod by 'Artist'.

For the third option, if several different artists performed at Sun God in the same year, that would not necessarily create duplicates in the 'Artist' column, unless of course two of the artists had the same name, which we’ve already addressed in the first answer choice. This is not a valid reason to avoid indexing sungod by 'Artist'.

For the last answer choice, if many different artists share the same value of 'Appearance_Order', this would not create duplicates in the 'Artist' column. Therefore, this is also not a valid reason to avoid indexing sungod by 'Artist'.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 9

Suppose in a new cell, we type the following.

    sungod.sort_values(by='Year')

After we run that cell, we type the following in a second cell.

    sungod.get('Artist').iloc[0]

What is the output when we run the second cell? Note that the first Sun God festival was held in 1983.

Answer: 'Blues Traveler'

In the first cell, although we seem to be sorting sungod by 'Year', we aren’t actually changing the DataFrame sungod at all because we don’t save the sorted DataFrame. Remember that DataFrame methods don’t actually change the underlying DataFrame unless you explicitly make that happen by saving the output as the name of the DataFrame. So the first 'Artist' name will still be 'Blues Traveler'.

Suppose we had saved the sorted DataFrame as in the code below.

    sungod = sungod.sort_values(by='Year')   
    sungod.get('Artist').iloc[0]

In this case, the output would be the name of an artist who appeared in 1983, but not necessarily the one who appeared first. There will be several artists associated with the year 1983, and we don’t know which of them will be first in the sorted DataFrame.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 12%.


Source: sp24-final — Q9

Problem 10

For each expression below, determine the data type of the output and the value of the expression, if possible. If there is not enough information to determine the expression’s value, write “Unknown” in the corresponding blank.


Problem 10.1

apts.get("Rent").iloc[43] * 4 / 2

type: ____ value: ____

Answer:

  • type: float
  • value: Unknown

We know that all values in the column Rent are ints. So, when we call .iloc[43] on this column (which grabs the 44th entry in the column), we know the result will be an int. We then perform some multiplication and division with this value. Importantly, when we divide an int, the type is automatically changed to a float, so the type of the final output will be a float. Since we do not explicitly know what the 44th entry in the Rent column is, the exact value of this float is unknown to us.


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 10.2

apts.get("Neighborhood").iloc[2][-3]

type: ____ value: ____

Answer:

  • type: str
  • value: “w”

This code takes the third entry (the entry at index 2) from the Neighborhood column of apts, which is a str, and it takes the third to last letter of that string. The third entry in the Neighborhood column is 'Midway', and the third to last letter of 'Midway' is 'w'. So, our result is a string with value w.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 10.3

(apts.get("Laundry") + 5).max()

type: ____ value: ____

Answer:

  • type: int
  • value: 6

This code deals with the Laundry column of apts, which is a Series of Trues and Falses. One property of Trues and Falses is that they are also interpreted by Python as ones and zeroes. So, the code (apts.get("Laundry") + 5).max() adds five to each of the ones and zeroes in this column, and then takes the maximum value from the column, which would be an int of value 6.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 10.4

apts.get("Complex").str.contains("Verde")

type: ____ value: ____

Answer:

  • type: Series
  • value: Unknown

This code takes the column (series) "Complex" and returns a new series of True and False values. Each True in the new column is a result of an entry in the "Complex" column containing "Verde". Each False in the new column is a result of an entry in the "Complex" column failing to contain "Verde". Since we are not given the entirety of the "Complex" column, the exact value of the resulting series is unknown to us.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 10.5

apts.get("Sqft").median() > 1000

type: ____ value: ____

Answer:

  • type: bool
  • value: Unknown

This code finds the median of the column (series) "Sqft" and compares it to a value of 1000, resulting in a bool value of True or False. Since we do not know the median of the "Sqft" column, the exact value of the resulting code is unknown to us.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Source: sp24-midterm — Q2

Problem 11

The first contact in contacts is your friend Calvin, who has an interesting phone number, with all the digits in descending order: 987-654-3210. Fill in the blanks below so that each expression evaluates to the sum of the digits in Calvin’s phone number.


Problem 11.1

contacts.get("Phone").apply(sum_phone).iloc[___(a)___]

Answer: 0

(a) should be filled with 0 because .iloc[0] refers to the first item in a Series, which corresponds to Calvin.


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 11.2

sum_phone(contacts.get("Phone").loc[___(b)___])

Answer: "Calvin"

(b) should be filled with "Calvin" because .loc[] accesses an element of Series by its row label. In this case, "Calvin" is the index label of the Series element that contains Calvin’s phone number.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 11.3

np.arange(__(c)__,__(d)__,__(e)__).sum() 

Answer:

  • (c): 0 or alternate solution 9

  • (d): 10 or alternate solution -1

  • (e): 1 or alternate solution -1

The expression uses np.arange() to generate a range of numbers and then sums them up. From the problem, we can see that Calvin’s phone number includes every digit from 9 to 0, so summing this is equivalent to summing the digits from 9 down to 0 or from 0 to 9.

np.arange(0, 10, 1) generates [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. Alternatively, using the numbers in descending order (like the digits in Calvin’s phone number): np.arange(9, -1, -1) generates [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]. Both correctly sum up to 45.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.



Source: su24-midterm — Q1

Problem 12


Problem 12.1

Which column of kart would work best as the index?

Answer: "Team"

Recall a choice for a good index would be a column wherein the values would act as a unique identifier for a particular observation. The only column that suits this description would be “Team” since each row represents a unique team.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 12.2

Write a line of python code that sets the index of kart to the column that you identified in part (1).

Answer: kart = kart.set_index("Team")

We use set_index(“Team”) to set the DataFrame’s index to “Team” and set this equal to kart to save this change.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 12.3

Fill in the blanks below to complete the implementation of the function division_to _int, which takes in a string, division, the name of a division, and returns the division’s number as an int. Example behavior is given below.

  >>> division_to_int("Division 1")
  1
  >>> division_to_int("Division 2")
  2
def division_to_int(division):
  __(i)__ __(ii)__(division.__(iii)__[__(iv)__])

Answer:

  • (i) return
  • (ii) int
  • (iii) split() or split(" ")
  • (iv) -1 or 1

Using the argument division, we just want to retrieve the number. To accomplish this, we can use split() on the division to separate the string in order to access the number. Note that the default behavior of split() is to split by the space.

The output of division.split() is now a list of “Division” and the number. To retrieve the number in this list, we can index for it with either -1 (since it is the last element of the list) or 1 (since it is the second element in the list). Because we want the output to be an integer type, we use int to cast the value to an integer. Finally, to have the function output the desired value, we start with a return statement.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 12.4

Write a single line of code that applies division_to_int to the existing "Division" column, and updates the "Division" column in kart.

Answer: kart.assign(Division=kart.get("Division").apply(division_to_int))

First let’s start by getting the information we want for the new column. We get the column for transformation with kart.get(“Division”) and use .apply(division_to_int) in order to apply the function to this column. In order to update this transformed Series as a column “Division”, we use the .assign method on the DataFrame and set the transformed Series to the column name “Division”. Note that when using .assign to add a column and the chosen column name already exists, .assign will update the information in this column with the new input information.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


For the rest of this exam, assume that the changes above have not been made and that kart is the same DataFrame that was described on the attached information sheet.


Source: su24-midterm — Q3

Problem 13

Fill in the blanks so that the expression below evaluates to the "University" of the lowest scoring Division 2 team over the whole season.

kart[__(a)__].__(b)__(__(c)__).get("University").iloc[0]

Answer:

  • (a) kart.get("Division") == "Division 2"
  • (b) sort values
  • (c) "Total Points" or by="Total Points"

First we want to filter the DataFrame for only the Division 2 teams using kart.get("Division") == "Division 2".

Since we are interested in the lowest scoring team in this division, we want to use sort_values on the "Total Points" column giving sort_values(by= “Total Points”). Since we are retrieving .iloc[0], i.e. the top row of the DataFrame, we do not have to specify the order of ranking since the default behavior of sort_values is ascending.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Difficulty: ⭐️

The average score on this problem was 90%.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 14

The command .set_index can take as input one column, to be used as the index, or a sequence of columns to be used as a nested index (sometimes called a MultiIndex). A MultiIndex is the default behavior of the dataframe returned by .groupby with multiple columns.

You are given a dataframe called restaurants that contains information on a variety of local restaurants’ daily number of customers and daily income. There is a row for each restaurant for each date in a given five-year time period.

The columns of restaurants are 'name' (str), 'year' (int), 'month' (int), 'day' (int), 'num_diners' (int), and 'income' (float).

Assume that in our data set, there are not two different restaurants that go by the same name (chain restaurants, for example).

Which of the following would be the best way to set the index for this dataset?

Answer: restaurants.set_index(['name', 'year', 'month', 'day'])

The correct answer is to create an index with the 'name', 'year', ‘month’, and ‘day’ columns. The question provides that there is a row for each restaurant for each data in the five year span. Therefore, we are interested in the granularity of a specific day (the day, the month, and the year). In order to have this information available in this index, we must set the index to be a multi index with columns ['name', 'year', 'month', 'day']. Looking at the other options, simply looking at the 'name' column would not account for the fact the dataframe contains daily data on customers and income for each restaurant. Similarly, the second option of ['name', 'month', 'day'] would not account for the fact that the data comes in a five year span so there will naturally be five overlaps (one for each year) for each unique date that must be accounted for.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.


Problem 15

Suppose you are given a DataFrame of employees for a given company. The DataFrame, called employees, is indexed by 'employee_id' (string) with a column called 'years' (int) that contains the number of years each employee has worked for the company.


Problem 15.1

Suppose that the code

employees.sort_values(by='years', ascending=False).index[0]

outputs '2476'.

True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

Answer: False

This is false because there could be other employees who worked at the company equally long as employee 2476.

The code says that when the employees DataFrame is sorted in descending order of 'years', employee 2476 is in the first row. There might, however, be a tie among several employees for their value of 'years'. In that case, employee 2476 may wind up in the first row of the sorted DataFrame, but we cannot say that the number of years employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

If the statement had said greater than or equal to instead of greater than, the statement would have been true.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 29%.


Problem 15.2

What will be the output of the following code?

employees.assign(start=2021-employees.get('years'))
employees.sort_values(by='start').index.iloc[-1]

Answer: an error message complaining about something else

The problem is that the first line of code does not actually add a new column to the employees DataFrame because the expression is not saved. So the second line tries to sort by a column, 'start', that doesn’t exist in the employees DataFrame and runs into an error when it can’t find a column by that name.

This code also has a problem with iloc[-1], since iloc cannot be used on the index, but since the problem with the missing 'start' column is encountered first, that will be the error message displayed.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.



Problem 16

You are given a DataFrame called sports, indexed by 'Sport' containing one column, 'PlayersPerTeam'. The first few rows of the DataFrame are shown below:

Sport PlayersPerTeam
baseball 9
basketball 5
field hockey 11


Which of the following evaluates to 'basketball'?

Answer: sports.index[1]

We are told that the DataFrame is indexed by 'Sport' and 'basketball' is one of the elements of the index. To access an element of the index, we use .index to extract the index and square brackets to extract an element at a certain position. Therefore, sports.index[1] will evaluate to 'basketball'.

The first two answer choices attempt to use .loc or .iloc directly on a DataFrame. We typically use .loc or .iloc on a Series that results from using .get on some column. Although we don’t typically do it this way, it is possible to use .loc or .iloc directly on a DataFrame, but doing so would produce an entire row of the DataFrame. Since we want just one word, 'basketball', the first two answer choices must be incorrect.

The last answer choice is incorrect because we can’t use .get with the index, only with a column. The index is never considered a column.


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Source: wi23-midterm — Q1

Problem 17


Problem 17.1

Which column would be an appropriate index for storms?

Answer: None of these.

An index should be unique. In this case 'Name', 'Date', 'Time', and 'Latitude' have repetative values, which does not make them unique. Remember 'Name' will be reused every six years, multiple hurricanes could happen on the same date, time, or latitude.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.



Source: wi23-midterm — Q4

Problem 18


Problem 18.1

Suppose there are n different storms included in storms. Say we create a new DataFrame from storms by adding a column called 'Duration' that contains the number of minutes since the first data entry for that storm, as an int. The first few rows of this new DataFrame are shown below.


Next we sort this DataFrame in ascending order of 'Duration' and save the result as storms_by_duration. Which of the following statements must be true? Select all that apply.

Answer: “The first n rows of storms_by_duration will all correspond to different storms, because they will contain the first reading from each storm in the data set.”

Let’s first analyze the directions. According to the directions, we added the column 'Duration', so we know how long each storm lasted. Then we sorted the DataFrame in ascending order, which will put the storms with the shortest duration at the top.

Each row will be tied to a unique storm because each storm can only have one minimum. This means storms_by_duration’s first n rows will contain the shortest duration for each unique storm, which corresponds to the first option.

Option 2: This is incorrect because even though the DataFrame is sorted in ascending order it is possible for a storm to have multiple close values in 'Duration', which does not guarantee unique storms in the last n rows. For example if you had the storm 'alice', which one time had a duration of 60 and the longest duration of 62. The values will be sorted such that 60 will come before 62, but they are within the last n values of the DataFrame, causing 'alice' to appear twice.

Option 3: This is incorrect because there can be more than n rows. It is possible that a storm appears multiple times. For example the storm Anna occured three different times on August 21, 1965 without sorting.

Option 4: This is incorrect. The code written will take the first n rows of the table, get the names, and find the number of unique named storms. Names are not unique, so it is possible for the storms to share the same name. This can be seen in the DataFrame example above.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Source: wi24-final — Q1

Problem 19

Which of the following columns would be an appropriate index for the olympians DataFrame?

Answer: None of these.

To decide what an appropriate index would be, we need to keep in mind that in each row, the index should have a unique value – that is, we want the index to uniquely identify rows of the DataFrame. In this case, there will obviously be repeats in "team" and "sport", since these will appear multiple times for each Olympic event. Although the name is unique for each athlete, the same athlete could compete in multiple Olympics (for example, Michael Phelps competed in both 2008 and 2004). So, none of these options is a valid index.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Source: wi25-midterm — Q1

Problem 20

Which of the following columns would be an appropriate index for the dining DataFrame?

Answer: None of these

None of the following columns would be an appropriate index since they all possibly contain duplicates. We are told that each row represents a single menu item available at one of the UCSD dining halls. This means that each row represents a combination of both "Dining Hall" and "Item" so no one column is sufficient to uniquely identify a row. We can see this in the preview of the first few rows of the DataFrame. There are multiple rows with the same value in the "Dining Hall" column, and also multiple rows with the same value in the "Item" column.

While "Price" and "Calories" could be unique, it doesn’t make sense to refer to a row by its price or number of calories. Further, we have no information that guarantees the values in these columns are unique (they’re probably not).


Difficulty: ⭐️⭐️

The average score on this problem was 76%.