Discussion 2: Python Basics, Arrays, and DataFrames

← return to practice.dsc10.com


The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

IKEA is a Swedish furniture company that designs and sells ready-to-assemble furniture and other home furnishings.

An IKEA fan created an app where people can log the amount of time it took them to assemble their IKEA furniture. The DataFrame app_data has a row for each product build that was logged on the app. The columns are:

The first few rows of app_data are shown below, though app_data has many more rows than pictured (5000 rows total).

 

Assume that we have already run import babypandas as bpd and import numpy as np.


Suppose that when someone downloads the app, the app requires them to choose a username, which must be different from all other registered usernames.

True or False: If app_data had included a column with the username of the person who reported each product build, it would make sense to index app_data by username.

Answer: False

Even though people must have distinct usernames, one person can build multiple different IKEA products and log their time for each build. So we don’t expect every row of app_data to have a distinct username associated with it, and therefore username would not be suitable as an index, since the index should have distinct values.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 2

Assume you have a DataFrame named ikea that contains information about IKEA products, including columns called 'product'(str): the name of the product, 'assembly_cost'(int): the assembly cost of each product, and 'packages'(int): the number of packages each product comes in. Complete the expression below so that it evaluates to the name of the product for which the average assembly cost per package is lowest.

(ikea.assign(assembly_per_package = ___(a)___)
     .sort_values(by='assembly_per_package').___(b)___)


Problem 2.1

What goes in blank (a)?

Answer: ikea.get('assembly_cost')/ikea.get('packages')

This column, as its name suggests, contains the average assembly cost per package, obtained by dividing the total cost of each product by the number of packages that product comes in. This code uses the fact that arithmetic operations between two Series happens element-wise.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 2.2

What goes in blank (b)?

Answer: get('product').iloc[0]

After adding the 'assembly_per_package' column and sorting by that column in the default ascending order, the product with the lowest 'assembly_per_package' will be in the very first row. To access the name of that product, we need to get the column containing product names and use iloc to access an element of that Series by integer position.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 66%.



Problem 3

For this question, we will work with a dataset consisting of various skyscrapers in the US, which we’ve loaded into a DataFrame called sky. The first few rows of sky are shown below (though the full DataFrame has more rows):

 

Each row of sky corresponds to a single skyscraper. For each skyscraper, we have:

Note that the height of a floor may be different in each building.

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.


Problem 3.1

sky.sort_values('height')

Answer: DataFrame

sky is a DataFrame. All the sort_values method does is change the order of the rows in the Series/DataFrame it is called on, it does not change the data structure. As such, sky.sort_values('height') is also a DataFrame.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 3.2

sky.sort_values('height').get('material').loc[0]

Answer: error

sky.sort_values('height') is a DataFrame, and sky.sort_values('height').get('material') is a Series corresponding to the 'material' column, sorted by 'height' in increasing order. So far, there are no errors.

Remember, the .loc accessor is used to access elements in a Series based on their index. sky.sort_values('height').get('material').loc[0] is asking for the element in the sky.sort_values('height').get('material') Series with index 0. However, the index of sky is made up of building names. Since there is no building named 0, .loc[0] causes an error.


Difficulty: ⭐️⭐️

The average score on this problem was 79%.


Problem 3.3

sky.sort_values('height').get('material').iloc[0]

Answer: string

As we mentioned above, sky.sort_values('height').get('material') is a Series containing values from the 'material' column (but sorted). Remember, there is no element in this Series with an index of 0, so sky.sort_values('height').get('material').loc[0] errors. However, .iloc[0] works differently than .loc[0]; .iloc[0] will give us the first element in a Series (independent of what’s in the index). So, sky.sort_values('height').get('material').iloc[0] gives us back a value from the 'material' column, which is made up of strings, so it gives us a string. (Specifically, it gives us the 'material' type of the skyscraper with the smallest 'height'.)


Difficulty: ⭐️⭐️

The average score on this problem was 89%.


Problem 3.4

sky.get('floors').max()

Answer: int or float

The Series sky.get('floors') is made up of integers, and sky.get('floors').max() evaluates to the largest number in the Series, which is also an integer.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 3.5

sky.index[0]

Answer: string

sky.index contains the values 'Bayard-Condict Building', 'The Yacht Club at Portofino', 'City Investing Building', etc. sky.index[0] is then 'Bayard-Condict Building', which is a string.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 4

Included is a DataFrame named sungod that contains information on the artists who have performed at Sun God in years past. For each year that the festival was held, we have one row for each artist that performed that year. The columns are:

The rows of sungod are arranged in no particular order. The first few rows of sungod are shown below (though sungod has many more rows than pictured here).

Assume:

Which of the following is a valid reason not to set the index of sungod to 'Artist'? Select all correct answers.

Answer: Two different artists have the same name., An artist performed at Sun God in more than one year.

For this question, it is crucial to know that an index should not contain duplicate values, so we need to consider reasons why 'Artist' might contain two values that are the same. Let’s go through the answer choices in order.

For the first option, if two different artists had the same name, this would lead to duplicate values in the 'Artist' column. Therefore, this is a valid reson not to index sungod by 'Artist'.

For the second option, if one artist performed at Sun God in more than one year, their name would appear multiple times in the 'Artist' column, once for each year they performed. This would also be a valid reason not to index sungod by 'Artist'.

For the third option, if several different artists performed at Sun God in the same year, that would not necessarily create duplicates in the 'Artist' column, unless of course two of the artists had the same name, which we’ve already addressed in the first answer choice. This is not a valid reason to avoid indexing sungod by 'Artist'.

For the last answer choice, if many different artists share the same value of 'Appearance_Order', this would not create duplicates in the 'Artist' column. Therefore, this is also not a valid reason to avoid indexing sungod by 'Artist'.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 5

Suppose in a new cell, we type the following.

    sungod.sort_values(by='Year')

After we run that cell, we type the following in a second cell.

    sungod.get('Artist').iloc[0]

What is the output when we run the second cell? Note that the first Sun God festival was held in 1983.

Answer: 'Blues Traveler'

In the first cell, although we seem to be sorting sungod by 'Year', we aren’t actually changing the DataFrame sungod at all because we don’t save the sorted DataFrame. Remember that DataFrame methods don’t actually change the underlying DataFrame unless you explicitly make that happen by saving the output as the name of the DataFrame. So the first 'Artist' name will still be 'Blues Traveler'.

Suppose we had saved the sorted DataFrame as in the code below.

    sungod = sungod.sort_values(by='Year')   
    sungod.get('Artist').iloc[0]

In this case, the output would be the name of an artist who appeared in 1983, but not necessarily the one who appeared first. There will be several artists associated with the year 1983, and we don’t know which of them will be first in the sorted DataFrame.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 12%.


Problem 6

Write one line of code below to create a DataFrame called openers containing the artists that appeared first on stage at a past Sun God festival. The DataFrame openers should have all the same columns as sungod.

Answer: openers = sungod[sungod.get('Appearance_Order')==1]

Since we want only certain rows of sungod, we need to query. The condition to satisfy is that the ‘Appearance_Order’ column should have a value of 1 to indicate that this artist performed first in a certain year’s festival.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 7

Suppose you are given a DataFrame of employees for a given company. The DataFrame, called employees, is indexed by 'employee_id' (string) with a column called 'years' (int) that contains the number of years each employee has worked for the company.


Problem 7.1

Suppose that the code

employees.sort_values(by='years', ascending=False).index[0]

outputs '2476'.

True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

Answer: False

This is false because there could be other employees who worked at the company equally long as employee 2476.

The code says that when the employees DataFrame is sorted in descending order of 'years', employee 2476 is in the first row. There might, however, be a tie among several employees for their value of 'years'. In that case, employee 2476 may wind up in the first row of the sorted DataFrame, but we cannot say that the number of years employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

If the statement had said greater than or equal to instead of greater than, the statement would have been true.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 29%.


Problem 7.2

What will be the output of the following code?

employees.assign(start=2021-employees.get('years'))
employees.sort_values(by='start').index.iloc[-1]

Answer: an error message complaining about something else

The problem is that the first line of code does not actually add a new column to the employees DataFrame because the expression is not saved. So the second line tries to sort by a column, 'start', that doesn’t exist in the employees DataFrame and runs into an error when it can’t find a column by that name.

This code also has a problem with iloc[-1], since iloc cannot be used on the index, but since the problem with the missing 'start' column is encountered first, that will be the error message displayed.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.