Discussion 2: Arrays and DataFrames

← return to practice.dsc10.com

Welcome! The problems shown below should be worked on on paper, since the quizzes and exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Problem 1

Evaluate the expression (np.arange(1, 7, 2.5) * np.arange(8, 2, -2))[2] .

Answer: 24.0

This question although is daunting at first, is best solved by breaking up the question into parts. First, let us think about the first part, np.arange(1, 7, 2.5). In order to answer this, we must figure out what np.arange() does. What np.arange() does is it creates a numpy array that contains regularly spaces values between a start value and an end value (start is inclusive, end is exclusive). So in this first case, our starting value is 1, our end value is 7, and the regular interval or step size is 2.5. So this call, np.arange(1, 7, 2.5), will output the numpy array np.array([1.0, 3.5, 6.0]) because we start at 1, and continue adding 2.5 stopping at the last value that’s less than 7. The reason the resulting np.array([]) containts all float values is because one of the numbers is not an int, and all elements in the array have to have the same data type. Now that we have evaluated the first half, let us now solve for np.arange(8, 2, -2). Now this part may seem a little tricky because of the negative regular interval (step size), but it is the same logic as before. The output will simply be np.array([8, 6, 4]). In order to get that, we start at 8, and continue to decrease our start value by 2 stopping before we reach 2. Now that we have evaluated both np.arange(1, 7, 2.5) and np.arange(8, 2, -2), it is now time to multiply.

Multiplication of two numpy arrays is simply a pair wise multiplication. So in our case, we will be multiplying np.array([1.0, 3.5, 6.0]) * np.array([8, 6, 4]), which results to np.array([8.0, 21.0, 24.0]). Again, paying attention to the datatypes, the reason that np.array([8.0, 21.0, 24.0]) contains float values rather than int values is because when you multiply an int by a float, your answer will be a float. Now that we have evaluated (np.arange(1, 7, 2.5) * np.arange(8, 2, -2)) to be np.array([8.0, 21.0, 24.0]), we now just need to access the element in position 2, which is 24.0.

Problem 2

For the problems that follow, we will work with a dataset consisting of various skyscrapers in the US, which we’ve loaded into a DataFrame called sky. The first few rows of sky are shown below (though the full DataFrame has more rows):


Each row of sky corresponds to a single skyscraper. For each skyscraper, we have:

Below, identify the data type of the result of each of the following expressions, or select “error” if you believe the expression results in an error.

Problem 2.1


Answer: DataFrame

sky is a DataFrame. All the sort_values method does is change the order of the rows in the Series/DataFrame it is called on, it does not change the data structure. As such, sky.sort_values('height') is also a DataFrame.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 2.2


Answer: error

sky.sort_values('height') is a DataFrame, and sky.sort_values('height').get('material') is a Series corresponding to the 'material' column, sorted by 'height' in increasing order. So far, there are no errors.

Remember, the .loc accessor is used to access elements in a Series based on their index. sky.sort_values('height').get('material').loc[0] is asking for the element in the sky.sort_values('height').get('material') Series with index 0. However, the index of sky is made up of building names. Since there is no building named 0, .loc[0] causes an error.

Difficulty: ⭐️⭐️

The average score on this problem was 79%.

Problem 2.3


Answer: string

As we mentioned above, sky.sort_values('height').get('material') is a Series containing values from the 'material' column (but sorted). Remember, there is no element in this Series with an index of 0, so sky.sort_values('height').get('material').loc[0] errors. However, .iloc[0] works differently than .loc[0]; .iloc[0] will give us the first element in a Series (independent of what’s in the index). So, sky.sort_values('height').get('material').iloc[0] gives us back a value from the 'material' column, which is made up of strings, so it gives us a string. (Specifically, it gives us the 'material' type of the skyscraper with the smallest 'height'.)

Difficulty: ⭐️⭐️

The average score on this problem was 89%.

Problem 2.4


Answer: int or float

The Series sky.get('floors') is made up of integers, and sky.get('floors').max() evaluates to the largest number in the Series, which is also an integer.

Difficulty: ⭐️

The average score on this problem was 91%.

Problem 2.5


Answer: string

sky.index contains the values 'Bayard-Condict Building', 'The Yacht Club at Portofino', 'City Investing Building', etc. sky.index[0] is then 'Bayard-Condict Building', which is a string.

Difficulty: ⭐️

The average score on this problem was 91%.

Problem 3

Problem 3.1

Write a single line of code that evaluates to the name of the tallest skyscraper in the sky DataFrame.

Answer: sky.sort_values(by='height', ascending=False).index[0]

In order to answer this question, we must first sort the values of the column we are interested in. As such, we sort the entire DataFrame by the height column, and because we are interested in the name of the tallest building, we should set the ascending parameter to False because we would like the heights to be ordered in descending order, thus leading to the line sky.sort_values(by='height', ascending=False). After sorting in descending order, we know that the tallest building is going to be the first row of the new sky DataFrame, and thus we now only need to get the name of the skyscraper, which happens to be in the index. In order to access the index of the DataFrame we can use sky.index, and in our case because we know that we want the first index, we would need to write sky.index[0]. Finally, putting it all together, in order to get the name of the tallest skyscraper in the sky DataFrame, we would need to write sky.sort_values(by='Height', ascending=False).index[0].

Problem 3.2

Write a single line of code that evaluates to the average number of floors across all skyscrapers in the DataFrame.

Answer: sky.get('floors').mean()

In order to answer the question, we must first figure out how to get the number of floors each skyscraper has. We can do this with a line of code like sky.get('floors') which will get the number of floors each skyscraper has. After doing this, we now need to find out the average number of floors each skyscraper has. We can do this by using the .mean() method, which in our case will get the average number of floors each skyscraper has. Putting this all togther, we get a line of code that looks like sky.get('floors').mean().

Problem 3.3

Write a single line of code that evaluates to the tallest skyscraper in New York City.

Answer: sky[sky.get('city') == 'New York City'].get('height').max()

In order to answer this question, we must first query the DataFrame to only include skyscrapers that are located in New York City. We can do this with a line such as sky[sky.get('city') == 'New York City']. After doing this, we know that the resulting DataFrame is only going to include skyscrapers from New York City, and we now can focus on getting the tallest building. In order to do so, we first need to get the heights of all the buildings in the resulting DataFrame which can be done with .get('height'). Now that we have gotten all the heights, we finally need to get the largest height, which can simply be done by using the .max() Series method. Putting it all together, we have a line that looks like sky[sky.get('city') == 'New York City'].get('height').max().

👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.