Discussion 3: Querying, Grouping, and Plotting

← return to practice.dsc10.com


These problems are taken from past quizzes and exams. Work on them on paper, since the quizzes and exams you take in this course will also be on paper.

We encourage you to complete these problems during discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all of these problems during the discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

The following code computes an array containing the unique kinds of dogs that are heavier than 20 kg or taller than 40 cm on average.

foo = df.__(a)__.__(b)__
np.array(foo[__(c)__].__d__)


Problem 1.1

Fill in blank (a).


Problem 1.2

Fill in blank (b).


Problem 1.3

Fill in blank (c).


Problem 1.4

Which of the following should fill in blank (d)?



Problem 2

You have a DataFrame called prices that contains information about food prices at 18 different grocery stores. There is column called 'broccoli' that contains the price in dollars for one pound of broccoli at each grocery store. There is also a column called 'ice_cream' that contains the price in dollars for a pint of store-brand ice cream.


Problem 2.1

Using the code,

prices.plot(kind='hist', y='broccoli', bins=np.arange(0.8, 2.11, 0.1), density=True)

we produced the histogram below:

How many grocery stores sold broccoli for a price greater than or equal to $1.30 per pound, but less than $1.40 per pound (the tallest bar)?


Problem 2.2

Suppose we now plot the same data with different bins, using the following line of code:

prices.plot(kind='hist', y='broccoli', bins=[0.8, 1, 1.1, 1.5, 1.8, 1.9, 2.5], density=True)

What would be the height on the y-axis for the bin corresponding to the interval [\$1.10, \$1.50)? Input your answer below.


Problem 2.3

You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?

prices.get(['broccoli', 'ice_cream']).plot(kind='barh')


Problem 2.4

You are interested in finding out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli. Will you be able to determine the answer to this question by looking at the plot produced by the code below?

prices.get(['broccoli', 'ice_cream']).plot(kind='hist')


Problem 2.5

Some code and the scatterplot that produced it is shown below:

(prices.get(['broccoli', 'ice_cream']).plot(kind='scatter', x='broccoli', y='ice_cream'))

Can you use this plot to figure out the number of stores in which a pint of ice cream was cheaper than a pound of broccoli?

If so, say how many such stores there are and explain how you came to that conclusion.

If not, explain why this scatterplot cannot be used to answer the question.



Problem 3

Suppose df is a DataFrame and b is any boolean array whose length is the same as the number of rows of df.

True or False: For any such boolean array b, df[b].shape[0] is less than or equal to df.shape[0].


Problem 4

You are given a DataFrame called books that contains columns 'author' (string), 'title' (string), 'num_chapters' (int), and 'publication_year' (int).

Suppose that after doing books.groupby('Author').max(), one row says

author title num_chapters publication_year
Charles Dickens Oliver Twist 53 1838


Problem 4.1

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?


Problem 4.2

Based on this data, can you conclude that Charles Dickens wrote Oliver Twist?


Problem 4.3

Based on this data, can you conclude that Oliver Twist has 53 chapters?


Problem 4.4

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?



Problem 5

Included is a DataFrame named sungod that contains information on the artists who have performed at Sun God in years past. For each year that the festival was held, we have one row for each artist that performed that year. The columns are:

The rows of sungod are arranged in no particular order. The first few rows of sungod are shown below (though sungod has many more rows than pictured here).

Assume:

On the graph paper below, draw the histogram that would be produced by this code.

(
sungod.take(np.arange(5))
      .plot(kind='hist', density=True, 
      bins=np.arange(0, 7, 2), y='Appearance_Order');
)

In your drawing, make sure to label the height of each bar in the histogram on the vertical axis. You can scale the axes however you like, and the two axes don’t need to be on the same scale.


Problem 6

King Triton, UCSD’s mascot, is quite the traveler! For this question, we will be working with the flights DataFrame, which details several facts about each of the flights that King Triton has been on over the past few years. The first few rows of flights are shown below.


Here’s a description of the columns in flights:

Problem 6.1

Which of these correctly evaluates to the number of flights King Triton took to San Diego (airport code 'SAN')?


Problem 6.2

Fill in the blanks below so that the result also evaluates to the number of flights King Triton took to San Diego (airport code 'SAN').

flights.groupby(__(a)__).count().get('FLIGHT').__(b)__            

What goes in blank (a)?

What goes in blank (b)?

True or False: If we change .get('FLIGHT') to .get('SEAT'), the results of the above code block will not change. (You may assume you answered the previous two subparts correctly.)


Problem 6.3

Consider the DataFrame san, defined below.

san = flights[(flights.get('FROM') == 'SAN') & (flights.get('TO') == 'SAN')]

Which of these DataFrames must have the same number of rows as san?



Problem 7

The American Kennel Club (AKC) organizes information about dog breeds. We’ve loaded their dataset into a DataFrame called df. The index of df contains the dog breed names as str values.

The columns are:

The rows of df are arranged in no particular order. The first five rows of df are shown below (though df has many more rows than pictured here).

Assume we have already run import babypandas as bpd and import numpy as np.



The following code computes the breed of the cheapest toy dog.

df[__(a)__].__(b)__.__(c)__


Problem 7.1

Fill in part (a).


Problem 7.2

Fill in part (b).


Problem 7.3

Which of the following can fill in blank (c)? Select all that apply.



Problem 8

In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).


The DataFrame evs consists of 32 rows, each of which contains information about a different EV model.

The first few rows of evs are shown below (though remember, evs has 32 rows total).


Assume that:



Suppose we’ve run the following line of code. ​

counts = evs.groupby("Brand").count()


Problem 8.1

What value does counts.get("Range").sum() evaluate to?


Problem 8.2

What value does counts.index[3] evaluate to?



Problem 9

Consider the following incomplete assignment statement.

result = evs______.mean()

In each part, fill in the blank above so that result evaluates to the specified quantity.


Problem 9.1

A DataFrame, indexed by "Brand", whose "Seats" column contains the average number of "Seats" per "Brand". (The DataFrame may have other columns in it as well.)


Problem 9.2

A number, corresponding to the average "TopSpeed" of all EVs manufactured by Audi in evs



Problem 10

Nintendo collected data on the heights of a sample of Animal Crossing: New Horizons players. A histogram of the heights in their sample is given below.

What percentage of players in Nintendo’s sample are at least 62.5 inches tall? Give your answer as an integer rounded to the nearest multiple of 5.


Problem 11

You are given a DataFrame called restaurants that contains information on a variety of local restaurants’ daily number of customers and daily income. There is a row for each restaurant for each date in a given five-year time period.

The columns of restaurants are 'name' (string), 'year' (int), 'month' (int), 'day' (int), 'num_diners' (int), and 'income' (float).

Assume that in our data set, there are not two different restaurants that go by the same 'name' (chain restaurants, for example).


Problem 11.1

What type of visualization would be best to display the data in a way that helps to answer the question “Do more customers bring in more income?”


Problem 11.2

What type of visualization would be best to display the data in a way that helps to answer the question “Have restaurants’ daily incomes been declining over time?”



Problem 12

Suppose there are 200 students enrolled in DSC 10, and that the histogram below displays the distribution of the number of Instagram followers each student has, measured in 100s. That is, if a student is represented in the first bin, they have between 0 and 200 Instagram followers.


Problem 12.1

How many students in DSC 10 have between 200 and 800 Instagram followers? Give your answer as an integer.


Problem 12.2

Suppose the height of a bar in the above histogram is h. How many students are represented in the corresponding bin, in terms of h?

Hint: Just as in the first subpart, you’ll need to use the assumption from the start of the problem.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.