# Extra Practice Session 2: Arrays, DataFrames, Querying, and Grouping

Welcome! The problems shown below should be worked on on paper, since the quizzes and exams you take in this course will also be on paper. You do not need to submit your solutions anywhere.

We encourage you to complete this worksheet in groups during an extra practice session on Friday, January 19. Solutions will be posted after all sessions have finished. This problem set is not designed to take any particular amount of time - focus on understanding concepts, not on getting through all the questions.

## Problem 1

Consider the following assignment statement.

puffin = np.array([5, 9, 13, 17, 21])

### Problem 1.1

Provide arguments to call np.arange with so that the array penguin is identical to the array puffin.

penguin = np.arange(____)

Answer: We need to provide np.arange with three arguments: 5, anything in (21, 25], 4. For instance, something line penguin = np.arange(5, 25, 4) would work.

##### Difficulty: ⭐️

The average score on this problem was 90%.

### Problem 1.2

Fill in the blanks so that the array parrot is also identical to the array puffin.
Hint: Start by choosing y so that parrot has length 5.

parrot = __(x)__ * np.arange(0, __(y)__, 2) + __(z)__

• x: 2
• y: anything in (8, 10]
• z: 5

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.

## Problem 2

Suppose students is a DataFrame of all students who took DSC 10 last quarter. students has one row per student, where:

• The index contains students’ PIDs as strings starting with "A".

• The "Overall" column contains students’ overall percentage grades as floats.

• The "Animal" column contains students’ favorite animals as strings.

### Problem 2.1

What type is students.get("Overall")? If this expression errors, select “this errors."

• float

• string

• array

• Series

• this errors

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

### Problem 2.2

What type is students.get("PID")? If this expression errors, select “this errors."

• float

• string

• array

• Series

• this errors

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.

Vanessa is one student who took DSC 10 last quarter. Her PID is A12345678, she earned the sixth-highest overall percentage grade in the class, and her favorite animal is the giraffe.

### Problem 2.3

Supposing that students is already sorted by "Overall" in descending order, fill in the blanks so that animal_one and animal_two both evaluate to "giraffe".

animal_one = students.get(__(x)__).loc[__(y)__]
animal_two = students.get(__(x)__).iloc[__(z)__]

• x: "Animal"
• y: "A12345678"
• z: 5

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.

### Problem 2.4

If students wasn’t already sorted by "Overall" in descending order, which of your answers would need to change?

• Neither y nor z would need to change

• Both y and z would need to change

• y only

• z only

Answer: z only

##### Difficulty: ⭐️⭐️

The average score on this problem was 82%.

## Problem 3

You are given a DataFrame called sports, indexed by 'Sport' containing one column, 'PlayersPerTeam'. The first few rows of the DataFrame are shown below:

Sport PlayersPerTeam
baseball 9
field hockey 11

Which of the following evaluates to 'basketball'?

• sports.loc[1]

• sports.iloc[1]

• sports.index[1]

• sports.get('Sport').iloc[1]

Answer: sports.index[1]

We are told that the DataFrame is indexed by 'Sport' and 'basketball' is one of the elements of the index. To access an element of the index, we use .index to extract the index and square brackets to extract an element at a certain position. Therefore, sports.index[1] will evaluate to 'basketball'.

The first two answer choices attempt to use .loc or .iloc directly on a DataFrame. We typically use .loc or .iloc on a Series that results from using .get on some column. Although we don’t typically do it this way, it is possible to use .loc or .iloc directly on a DataFrame, but doing so would produce an entire row of the DataFrame. Since we want just one word, 'basketball', the first two answer choices must be incorrect.

The last answer choice is incorrect because we can’t use .get with the index, only with a column. The index is never considered a column.

##### Difficulty: ⭐️⭐️

The average score on this problem was 88%.

## Problem 4

Suppose you are given a DataFrame of employees for a given company. The DataFrame, called employees, is indexed by 'employee_id' (string) with a column called 'years' (int) that contains the number of years each employee has worked for the company.

### Problem 4.1

Suppose that the code

employees.sort_values(by='years', ascending=False).index[0]

outputs '2476'.

True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

• True

• False

This is false because there could be other employees who worked at the company equally long as employee 2476.

The code says that when the employees DataFrame is sorted in descending order of 'years', employee 2476 is in the first row. There might, however, be a tie among several employees for their value of 'years'. In that case, employee 2476 may wind up in the first row of the sorted DataFrame, but we cannot say that the number of years employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.

If the statement had said greater than or equal to instead of greater than, the statement would have been true.

##### Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 29%.

### Problem 4.2

What will be the output of the following code?

employees.assign(start=2021-employees.get('years'))
employees.sort_values(by='start').index.iloc[-1]
• the employee id of an employee who has worked there for the most years

• the employee id of an employee who has worked there for the fewest years

• an error message complaining about iloc[-1]

• an error message complaining about something else

The problem is that the first line of code does not actually add a new column to the employees DataFrame because the expression is not saved. So the second line tries to sort by a column, 'start', that doesn’t exist in the employees DataFrame and runs into an error when it can’t find a column by that name.

This code also has a problem with iloc[-1], since iloc cannot be used on the index, but since the problem with the missing 'start' column is encountered first, that will be the error message displayed.

##### Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.

## Problem 5

Suppose df is a DataFrame and b is any boolean array whose length is the same as the number of rows of df.

True or False: For any such boolean array b, df[b].shape[0] is less than or equal to df.shape[0].

• True

• False

The brackets in df[b] perform a query, or filter, to keep only the rows of df for which b has a True entry. Typically, b will come from some condition, such as the entry in a certain column of df equaling a certain value. Regardless, df[b] contains a subset of the rows of df, and .shape[0] counts the number of rows, so df[b].shape[0] must be less than or equal to df.shape[0].

##### Difficulty: ⭐️⭐️

The average score on this problem was 86%.

## Problem 6

You are given a DataFrame called books that contains columns 'author' (string), 'title' (string), 'num_chapters' (int), and 'publication_year' (int).

Suppose that after doing books.groupby('Author').max(), one row says

author title num_chapters publication_year
Charles Dickens Oliver Twist 53 1838

### Problem 6.1

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?

• Yes

• No

When we group by 'Author', all books by the same author get aggregated together into a single row. The aggregation function is applied separately to each other column besides the column we’re grouping by. Since we’re grouping by 'Author' here, the 'Author'column never has the max() function applied to it. Instead, each unique value in the 'Author' column becomes a value in the index of the grouped DataFrame. We are told that the Charles Dickens row is just one row of the output, but we don’t know anything about the other rows of the output, or the other authors. We can’t say anything about where Charles Dickens falls when authors are ordered alphabetically (but it’s probably not last!)

##### Difficulty: ⭐️

The average score on this problem was 94%.

### Problem 6.2

Based on this data, can you conclude that Charles Dickens wrote Oliver Twist?

• Yes

• No

Grouping by 'Author' collapses all books written by the same author into a single row. Since we’re applying the max() function to aggregate these books, we can conclude that Oliver Twist is alphabetically last among all books in the books DataFrame written by Charles Dickens. So Charles Dickens did write Oliver Twist based on this data.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 6.3

Based on this data, can you conclude that Oliver Twist has 53 chapters?

• Yes

• No

The key to this problem is that groupby applies the aggregation function, max() in this case, independently to each column. The output should be interpreted as follows:

• Among all books in books written by Charles Dickens, Oliver Twist is the title that is alphabetically last.
• Among all books in books written by Charles Dickens, 53 is the greatest number of chapters.
• Among all books in books written by Charles Dickens, 1838 is the latest year of publication.

However, the book titled Oliver Twist, the book with 53 chapters, and the book published in 1838 are not necessarily all the same book. We cannot conclude, based on this data, that Oliver Twist has 53 chapters.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.

### Problem 6.4

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?

• Yes

• No

As explained in the previous question, the max() function is applied separately to each column, so the book written by Charles Dickens with 53 chapters may not be the same book as the book written by Charles Dickens published in 1838.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

## Problem 7

King Triton, UCSD’s mascot, is quite the traveler! For this question, we will be working with the flights DataFrame, which details several facts about each of the flights that King Triton has been on over the past few years. The first few rows of flights are shown below.

Here’s a description of the columns in flights:

• 'DATE': the date on which the flight occurred. Assume that there were no “redeye” flights that spanned multiple days.
• 'FLIGHT': the flight number. Note that this is not unique; airlines reuse flight numbers on a daily basis.
• 'FROM' and 'TO': the 3-letter airport code for the departure and arrival airports, respectively. Note that it’s not possible to have a flight from and to the same airport.
• 'DIST': the distance of the flight, in miles.
• 'HOURS': the length of the flight, in hours.
• 'SEAT': the kind of seat King Triton sat in on the flight; the only possible values are 'WINDOW', 'MIDDLE', and 'AISLE'.

### Problem 7.1

Which of these correctly evaluates to the number of flights King Triton took to San Diego (airport code 'SAN')?

• flights.loc['SAN'].shape[0]

• flights[flights.get('TO') == 'SAN'].shape[0]

• flights[flights.get('TO') == 'SAN'].shape[1]

• len(flights.sort_values('TO', ascending=False).loc['SAN'])

Answer: flights[flights.get('TO') == 'SAN'].shape[0]

The strategy is to create a DataFrame with only the flights that went to San Diego, then count the number of rows. The first step is to query with the condition flights.get('TO') == 'SAN' and the second step is to extract the number of rows with .shape[0].

Some of the other answer choices use .loc['SAN'] but .loc only works with the index, and flights does not have airport codes in its index.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 7.2

Fill in the blanks below so that the result also evaluates to the number of flights King Triton took to San Diego (airport code 'SAN').

flights.groupby(__(a)__).count().get('FLIGHT').__(b)__            

What goes in blank (a)?

• 'DATE'

• 'FLIGHT'

• 'FROM'

• 'TO'

What goes in blank (b)?

• .index[0]

• .index[-1]

• .loc['SAN']

• .iloc['SAN']

• .iloc[0]

True or False: If we change .get('FLIGHT') to .get('SEAT'), the results of the above code block will not change. (You may assume you answered the previous two subparts correctly.)

• True

• False

Answer: 'TO', .loc['SAN'], True

The strategy here is to group all of King Triton’s flights according to where they landed, and count up the number that landed in San Diego. The expression flights.groupby('TO').count() evaluates to a DataFrame indexed by arrival airport where, for any arrival airport, each column has a count of the number of King Triton’s flights that landed at that airport. To get the count for San Diego, we need the entry in any column for the row corresponding to San Diego. The code .get('FLIGHT') says we’ll use the 'FLIGHT' column, but any other column would be equivalent. To access the entry of this column corresponding to San Diego, we have to use .loc because we know the name of the value in the index should be 'SAN', but we don’t know the row number or integer position.

##### Difficulty: ⭐️⭐️

The average score on this problem was 89%.

### Problem 7.3

Consider the DataFrame san, defined below.

san = flights[(flights.get('FROM') == 'SAN') & (flights.get('TO') == 'SAN')]

Which of these DataFrames must have the same number of rows as san?

• flights[(flights.get('FROM') == 'SAN') and (flights.get('TO') == 'SAN')]

• flights[(flights.get('FROM') == 'SAN') | (flights.get('TO') == 'SAN')]

• flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'SAN')]

• flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]

Answer: flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')]

The DataFrame san contains all rows of flights that have a departure airport of 'SAN' and an arrival airport of 'SAN'. But as you may know, and as you’re told in the data description, there are no flights from an airport to itself. So san is actually an empty DataFrame with no rows!

We just need to find which of the other DataFrames would necessarily be empty, and we can see that flights[(flights.get('FROM') == 'LAX') & (flights.get('TO') == 'LAX')] will be empty for the same reason.

Note that none of the other answer choices are correct. The first option uses the Python keyword and instead of the symbol &, which behaves unexpectedly but does not give an empty DataFrame. The second option will be non-empty because it will contain all flights that have San Diego as the departure airport or arrival airport, and we already know from the first few rows of flight that there are some of these. The third option will contain all the flights that King Triton has taken from 'LAX' to 'SAN'. Perhaps he’s never flown this route, or perhaps he has. This DataFrame could be empty, but it’s not necessarily going to be empty, as the question requires.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

## Problem 8

The American Kennel Club (AKC) organizes information about dog breeds. We’ve loaded their dataset into a DataFrame called df. The index of df contains the dog breed names as str values.

The columns are:

• 'kind' (str): the kind of dog (herding, hound, toy, etc.). There are six total kinds.
• 'size' (str): small, medium, or large.
• 'longevity' (float): typical lifetime (years).
• 'price' (float): average purchase price (dollars).
• 'kids' (int): suitability for children. A value of 1 means high suitability, 2 means medium, and 3 means low.
• 'weight' (float): typical weight (kg).
• 'height' (float): typical height (cm).

The rows of df are arranged in no particular order. The first five rows of df are shown below (though df has many more rows than pictured here).

Assume we have already run import babypandas as bpd and import numpy as np.

The following code computes the breed of the cheapest toy dog.

df[__(a)__].__(b)__.__(c)__

### Problem 8.1

Fill in part (a).

Answer: df.get('kind') == 'toy'

To find the cheapest toy dog, we can start by narrowing down our dataframe to only include dogs that are of kind toy. We do this by constructing the following boolean condition: df.get('kind') == 'toy', which will check whether a dog is of kind toy (i.e. whether or not a given row’s 'kind' value is equal to 'toy'). As a result, df[df.get('kind') == 'toy'] will retrieve all rows for which the 'kind' column is equal to 'toy'.

##### Difficulty: ⭐️

The average score on this problem was 91%.

### Problem 8.2

Fill in part (b).

Answer: .sort_values('price')

Next, we can sort the resulting dataframe by price, which will make the minimum price (i.e. the cheapest toy dog) easily accessible to us later on. To sort the dataframe, simply use .sort_values(), with parameter 'price' as follows: .sort_values('price')

##### Difficulty: ⭐️⭐️

The average score on this problem was 86%.

### Problem 8.3

Which of the following can fill in blank (c)? Select all that apply.

• loc[0]

• iloc[0]

• index[0]

• min()

Answer: index[0]

• loc[0]: loc retrieves an element by the row label, which in this case is by 'breed', not by index value. Furthermore, loc actually returns the entire row, which is not what we are looking for. (Note that we are trying to find the singular 'breed' of the cheapest toy dog.)
• iloc[0]: While iloc does retrieve elements by index position, iloc actually returns the entire row, which is not what we are looking for.
• index[0]: Note that since 'breed' is the index column of our dataframe, and since we have already filtered and sorted the dataframe, simply taking the 'breed' at index 0, or index[0] will return the 'breed' of the cheapest toy dog.
• min(): min() is a method used to find the smallest value on a series not a dataframe.

##### Difficulty: ⭐️⭐️

The average score on this problem was 81%.

## Problem 9

The following code computes an array containing the unique kinds of dogs that are heavier than 20 kg or taller than 40 cm on average.

foo = df.__(a)__.__(b)__
np.array(foo[__(c)__].__d__)

### Problem 9.1

Fill in blank (a).

Answer: groupby('kind')

We start this problem by grouping the dataframe by 'kind' since we’re only interested in whether each unique 'kind' of dog fits some sort of constraint. We don’t quite perform querying yet since we need to group the DataFrame first. In other words, we first need to group the DataFrame into each 'kind' before we could apply any sort of boolean conditionals.

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 9.2

Fill in blank (b).

Answer: .mean()

After we do .groupby('kind'), we need to apply .mean() since the problem asks if each unique 'kind' of dog satisfies certain constraints on average. .mean() calculates the average of each column of each group which is what we want.

##### Difficulty: ⭐️

The average score on this problem was 94%.

### Problem 9.3

Fill in blank (c).

Answer: (foo.get('weight') > 20 | foo.get(height) > 40)

Once we have grouped the dogs by 'kind' and have calculated the average stats of each kind of dog, we can do some querying with two conditionals: foo.get('weight') > 20 gets the kinds of dogs that are heavier than 20 kg on average and foo.get('height') > 40) gets the kinds of dogs that are taller than 40 cm on average. We combine these two conditions with the | operator since we want the kind of dogs that satisfy either condition.

##### Difficulty: ⭐️

The average score on this problem was 93%.

### Problem 9.4

Which of the following should fill in blank (d)?

• .index

• .unique()

• .get('kind')

• .get(['kind'])

Answer: .index

Note that earlier, we did groupby('kind'), which automatically sets each unique 'kind' as the index. Since this is what we want anyways, simply doing .index will give us all the kinds of dogs that satisfy the given conditions.

##### Difficulty: ⭐️

The average score on this problem was 94%.

## Problem 10

In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).

The DataFrame evs consists of 32 rows, each of which contains information about a different EV model.

• "Brand" (str): The vehicle’s manufacturer.
• "Model" (str): The vehicle’s model name.
• "BodyStyle" (str): The vehicle’s body style.
• "Seats" (int): The vehicle’s number of seats.
• "TopSpeed" (int): The vehicle’s top speed, in kilometers per hour.
• "Range" (int): The vehicle’s range, or distance it can travel on a single charge, in kilometers.

The first few rows of evs are shown below (though remember, evs has 32 rows total).

Assume that:

• The only four values in the "Brand" column are "Tesla", "BMW", "Audi", and "Nissan".
• We have already run import babypandas as bpd and import numpy as np.

Suppose we’ve run the following line of code. ​

counts = evs.groupby("Brand").count()

### Problem 10.1

What value does counts.get("Range").sum() evaluate to?

counts is a DataFrame with one row per "Brand", since we grouped by "Brand". Since we used the .count() aggregation method, the columns in counts will all be the same – they will all contain the number of rows in evs for each "Brand" (i.e. they will all contain the distribution of "Brand"). If we sum up the values in any one of the columns in counts, then, the result will be the total number of rows in evs, which we know to be 32. Thus, counts.get("Range").sum() is 32.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

### Problem 10.2

What value does counts.index[3] evaluate to?

Answer: "Tesla"

Since we grouped by "Brand" to create counts, the index of counts will be "Brand", sorted alphabetically (this sorting happens automatically when grouping). This means that counts.index will be the array-like sequence ["Audi", "BMW", "Nissan", "Tesla"], and counts.index[3] is "Tesla".

##### Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 33%.

## Problem 11

Consider the following incomplete assignment statement.

result = evs______.mean()

In each part, fill in the blank above so that result evaluates to the specified quantity.

### Problem 11.1

A DataFrame, indexed by "Brand", whose "Seats" column contains the average number of "Seats" per "Brand". (The DataFrame may have other columns in it as well.)

Answer: .groupby("Brand")

When we group by a column, the resulting DataFrame contains one row for every unique value in that column. The question specified that we wanted some information per "Brand", which implies that grouping by "Brand" is necessary.

After grouping, we need to use an aggregation method. Here, we wanted the resulting DataFrame to be such that the "Seats" column contained the average number of "Seats" per "Brand"; this is accomplished by using .mean(), which is already done for us.

Note: With the provided solution, the resulting DataFrame also has other columns. For instance, it has a "Range" column that contains the average "Range" for each "Brand". That’s fine, since we were told that the resulting DataFrame may have other columns in it as well. If we wanted to ensure that the only column in the resulting DataFrame was "Seats", we could have used .get(["Brand", "Seats"]) before grouping, though this was not necessary.

##### Difficulty: ⭐️⭐️

The average score on this problem was 76%.

### Problem 11.2

A number, corresponding to the average "TopSpeed" of all EVs manufactured by Audi in evs

Answer: [evs.get("Brand") == "Audi"].get("TopSpeed")

There are two parts to this problem:

1. Querying, to make sure that we only keep the rows corresponding to Audis. This is accomplished by:

• Using evs.get("Brand") == "Audi" to create a Boolean Series, with Trues for the rows we want to keep and Falses for the other rows.
• Using Boolean indexing to keep only the rows in which the aforementioned Series is True. This is accomplished by evs[evs.get("Brand") == "Audi"] (though the evs part at the front was already provided).
2. Accessing the "TopSpeed" column. This is accomplished by using .get("TopSpeed").

Then, evs[evs.get("Brand") == "Audi"].get("TopSpeed") is a Series contaning the "TopSpeed"s of all Audis, and mean of this Series is the result we’re looking for. The call to .mean() was already provided for us.

##### Difficulty: ⭐️⭐️

The average score on this problem was 77%.