Summer 2024 Midterm Exam

← return to practice.dsc10.com


Instructor(s): Nishant Kheterpal

This exam was administered in-person. The exam was closed-notes, except students were provided a copy of the DSC 10 Reference Sheet. No calculators were allowed. Students had 50 minutes to take this exam.


In this exam, you’ll work with a data set representing the results of the Collegiate Mario Kart League. Each row represents a team and their performance in the league. The columns of kart are as follows:

The first few rows of kart are shown below, though kart has 12 rows in total (since there are 12 teams total, 6 in each division).


Assume that we have already run import babypandas as bpd and import numpy as np.


Problem 1


Problem 1.1

Which column of kart would work best as the index?

Answer: "Team"

Recall a choice for a good index would be a column wherein the values would act as a unique identifier for a particular observation. The only column that suits this description would be “Team” since each row represents a unique team.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 1.2

Write a line of python code that sets the index of kart to the column that you identified in part (1).

Answer: kart = kart.set_index("Team")

We use set_index(“Team”) to set the DataFrame’s index to “Team” and set this equal to kart to save this change.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 1.3

Fill in the blanks below to complete the implementation of the function division_to _int, which takes in a string, division, the name of a division, and returns the division’s number as an int. Example behavior is given below.

  >>> division_to_int("Division 1")
  1
  >>> division_to_int("Division 2")
  2
def division_to_int(division):
  __(i)__ __(ii)__(division.__(iii)__[__(iv)__])

Answer:

  • (i) return
  • (ii) int
  • (iii) split() or split(" ")
  • (iv) -1 or 1

Using the argument division, we just want to retrieve the number. To accomplish this, we can use split() on the division to separate the string in order to access the number. Note that the default behavior of split() is to split by the space.

The output of division.split() is now a list of “Division” and the number. To retrieve the number in this list, we can index for it with either -1 (since it is the last element of the list) or 1 (since it is the second element in the list). Because we want the output to be an integer type, we use int to cast the value to an integer. Finally, to have the function output the desired value, we start with a return statement.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 52%.


Problem 1.4

Write a single line of code that applies division_to_int to the existing "Division" column, and updates the "Division" column in kart.

Answer: kart.assign(Division=kart.get("Division").apply(division_to_int))

First let’s start by getting the information we want for the new column. We get the column for transformation with kart.get(“Division”) and use .apply(division_to_int) in order to apply the function to this column. In order to update this transformed Series as a column “Division”, we use the .assign method on the DataFrame and set the transformed Series to the column name “Division”. Note that when using .assign to add a column and the chosen column name already exists, .assign will update the information in this column with the new input information.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


For the rest of this exam, assume that the changes above have not been made and that kart is the same DataFrame that was described on the attached information sheet.


Problem 2

Fill in the blanks so that the expression below evaluates to the region in the "Region" column that comes first alphabetically.

kart.groupby(___(a)___).count().index.___(b)___

What goes in (a)?

What goes in (b)?

Answer:

  • (a) "Region"
  • (b) "min()"

In order to curate a unique list of all the regions in the DataFrame, we want to groupby “Region". Using count() on this will count the number of observations for each region as a row in the DataFrame which does not really matter here since we just wanted the unique regions. .index will retrieve the list of all the unique regions. In order to get the first alphabetically, we can use min().


Difficulty: ⭐️

The average score on this problem was 100%.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 3

Fill in the blanks so that the expression below evaluates to the "University" of the lowest scoring Division 2 team over the whole season.

kart[__(a)__].__(b)__(__(c)__).get("University").iloc[0]

Answer:

  • (a) kart.get("Division") == "Division 2"
  • (b) sort values
  • (c) "Total Points" or by="Total Points"

First we want to filter the DataFrame for only the Division 2 teams using kart.get("Division") == "Division 2".

Since we are interested in the lowest scoring team in this division, we want to use sort_values on the "Total Points" column giving sort_values(by= “Total Points”). Since we are retrieving .iloc[0], i.e. the top row of the DataFrame, we do not have to specify the order of ranking since the default behavior of sort_values is ascending.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Difficulty: ⭐️

The average score on this problem was 90%.


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 4

Consider the scatterplot generated by the following expression:

kart.plot(kind="scatter, x="Total Points", y="Races Won")

Which of the following questions would you be able to answer from this scatterplot? Select all that apply.

Answer:: Option 2 and Option 3

  • Option 1 is incorrect since the scatter plot shows all the observations based on the x and y values and they are not labeled by team. Thus, team information is not shown on this scatterplot.
  • Option 4 is incorrect since the scatter plot does not label the teams and thus you would not know this information.
  • Option 2 is correct. Since the x axis is total points, you can count the number of observations that fall at x≥6000.
  • Option 3 is correct. The phrase “100 total points per race that they won” indicates a relationship between total points and races won. Since the x-axis denotes total points and the y-axis represents races won, this relationship is simply the line x=100y. To answer this question, since we are interested in the number of teams that scored less than this, you can count the number of points below this line.

Difficulty: ⭐️

The average score on this problem was 95%.


Problem 5


Problem 5.1

Fill in the blanks so that the sentence below correctly describes the meaning of mystery.

“The __(i)__ __(ii)__ of the __(iii)__ __(iv)__ .”

mystery = kart.groupby("Region").mean().get("Ranking").min()

What goes in (i)?

What goes in (ii)?

What goes in (iii)?

What goes in (iv)?

Answer:

  • (i) average
  • (ii) ranking
  • (iii) best
  • (iv) region

mystery groups by the "Region", calculates the mean values and retrieves the "Ranking" column. So we have the average ranking for all the regions. Now we retrieve the minimum value of this column. Note that in this context, a lower average ranking indicates a better team. Hence, we get “the average ranking of the best region”.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Difficulty: ⭐️

The average score on this problem was 95%.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 14%.


Difficulty: ⭐️

The average score on this problem was 95%.


Problem 5.2

Fill in the blank in the code snippet below so that my_bool is True if there is at least one university with two teams in the same division, and False if there are no universities with two teams in the same division. Your answer must include the use of .groupby() in order to receive credit.

my_df = ______
my_bool = my_df.get("Team").max() > 1

Answer: kart.groupby(["University", "Division"]).count()

In the second line of code, we get the column “Team” and check whether its maximum value is greater than 1. In other words, it is checking whether there are any universities with multiple teams in a division. This means that our input DataFrame needs to have a numeric column "Team" that denotes the number of teams for each university for each division. This means we need to use groupby(). Since we want this information for each university within each division, we need to groupby both “University” and “Division”. Then, because we want the number of teams within each division within each university, you want to apply the count() aggregate. Now all the columns in this DataFrame contain the number of teams per division per university since count notes the number of observations in each category.

From here, my_bool can take this transformed DataFrame, get the “Team” column that contains the number of teams per division per university and check if any of them are greater than 1 (by simply getting the maximum value of this column) and checking whether it is greater than 1.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 6

Below are two density histograms representing the distribution of Total Points for Division 1 and the distribution of Total Points for Division 2 teams (remember, there are six teams in each division):


Problem 6.1

Assuming we know the bin values for each histogram, what can we conclude from these two histograms? Select all that apply:

Answer: Option 1, 2, 3, and 4

Since we are working with density histograms, each rectangle’s area represents the relative frequency of the corresponding bin. Given that there are six teams in each division, we can use the relative frequencies to approximate the number of teams in each bin.


  • Option 1 is correct:
    • The rightmost bin on the histogram for Division 2 has a height of approximately 0.00017 and a width of 1000. The area of the bar, which represents the relative frequency for this bin, is calculated as follows:
      • Relative Frequency = 0.00017 × 1000 = 0.17
      • This indicates that around 17% of Division 2 teams fall within this bin. Given that there are six teams in Division 2, we can estimate the number of teams in this bin by:
      • 0.17 × 6 = 1.02, rounding to the nearest whole number, this results in approximately 1 team.
  • Option 2 is correct:
    • From the Division 1 histogram, we can see that the bin from 4000 to 4500 is empty. Since the bins are inclusive on the left and exclusive on the right, we can conclude that there are no division 1 teams that scored 4000 points.
  • Option 3 is correct:
    • Similarly, for the [2000, 3000) bin in Division 2, we can calculate the area of the bar and then multiply it by the total number of teams (6) to estimate the number of teams that fall within this range.
  • Option 4 is correct:
    • The lowest-scoring bin for Division 1 is [3500, 4000). Division 2 has a bin that starts at a lower range [2000, 3000), which is below Division 1’s minimum bin. Therefore, we can determine the number of Division 2 teams that scored fewer points than the lowest-scoring Division 1 team.

Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 6.2

Suppose that we changed the histogram of total points for Division 2 teams so that the bins were [2000, 4000), [4000, 6000), [6000, 8000). If the bin defined by [2000, 4000) contained one team, as it does in the original graph, what would the height of the middle bar (with bin [4000, 6000)) be? Do not simplify your answer.

Answer: \frac{1}{3000}

First, we need to calculate the number of teams that scored in the range of [4000,6000) in the original histogram for Division 2 teams:

Area of bar in bin[4000, 5000) * 6 + Area of bar in bin[5000, 6000) * 6 = 0.0005 * 1000 * 6 + 0.00017 * 1000 * 6 = 4.02

Rounding to the nearest whole number, we find that approximately 4 teams fall within this range.

Next, we want to calculate the height of the bar with bin [4000, 6000) in the new histogram. Using the fact that this bin should contain 4 teams, we have:

Height * Width * 6 = 4

Solving for the height:

Height = 4 / (6 * 2000) = \frac{1}{3000}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 6.3

Suppose we drew different bins for the histogram of total points for Division 2 teams. If the bin defined by [2000, 4000) contained one team, as it does in the original graph, and the bin defined by [4000, 4500) contained two teams, what would the height of the bar with bin [4500, 5000) be? Do not simplify your answer.

Answer: \frac{1}{3000}

In the original histogram, the bin defined by [4000, 5000) contained: 0.0005 * 1000 * 6 = 3 teams

According to the description, the bin defined by [4000, 4500) contained 2 teams.

Therefore, we can conclude that the bin defined by [4500, 5000) contained: 3 - 2 = 1 team

In the bin [4500, 5000) of the new histogram, we have: Height * Width * 6 = 1

Solving for the height: Height = 1 / (6 * Width) = 1 / (6 * 500) = \frac{1}{3000}


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 32%.



Problem 7

The DataFrame div_one, shown below to the left, contains the six Division 1 teams from kart, with just the "Team" and "Region" columns selected.

The DataFrame coach, shown below to the right, contains some additional information about the coaches of the teams in div_one, with a piece of info missing.


Problem 7.1

Fill in the blank in the Region column with either South, West, East, or Midwest so that the DataFrame div_one.merge(coach, on="Region") has:

Answer:

  • 8 rows: South
  • 9 rows: East
  • 10 rows: West
  • 11 rows: Midwest

To determine the number of rows after merging two DataFrames (div_one and coach) on the column "Region", we need to calculate how many matches occur for each region. When performing the merge, each row from div_one will be combined with every matching row in coach. The total number of rows in the resulting DataFrame will depend on how many times each region value in div_one has matching rows in coach.

  1. South (8 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 2 rows with Midwest, and 2 rows with East.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, resulting in 3×2=6 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • Total for South: 3 * 2(Midwest) + 1 * 2(East) = 8 rows.
  2. East (9 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 2 rows with Midwest, and 3 rows with East.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, giving 3×2=6 matches.
    • The single East row in div_one matches with the 3 East rows in coach, contributing 1×3=3 matches.
    • Total for East: 3 * 2(Midwest) + 1 * 3(East) = 9 rows.
  3. West (10 rows):
    • div_one has 3 rows with Midwest, 1 row with East, and 2 rows with West.
    • coach has 2 rows with Midwest, 2 rows with East, and 1 row with West.
    • For each Midwest row in div_one, it matches with both Midwest rows in coach, resulting in 3×2=6 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • The 2 West rows in div_one each match with the single West row in coach, contributing 2×1=2 matches.
    • Total for West: 3 * 2(Midwest) + 1 * 2(East) + 2 * 1(West) = 10 rows.
  4. Midwest (11 rows):
    • div_one has 3 rows with Midwest, and 1 row with East.
    • coach has 3 rows with Midwest, and 2 rows with East.
    • For each Midwest row in div_one, it matches with all three Midwest rows in coach, resulting in 3×3=9 matches.
    • The single East row in div_one matches with the 2 East rows in coach, contributing 1×2=2 matches.
    • Total for Midwest: 3 * 3(Midwest) + 1 * 2(East) = 11 rows.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 61%.


Problem 7.2

What is the value of the following Python expression?

coach.merge(coach, on="Coach").merge(coach, on="Coach").shape[0]

Answer: 12

First Merge: The expression starts by merging the coach DataFrame with itself on the “Coach” column. Each row in coach that has the same coach will match with every other row (including itself) that has the same coach.

  • Matching process:
    • Coach Jason (rows 0 and 2): Matches with each other, creating 2×2=4 matches.
    • Coach Jack (row 1): Since this row only matches with itself, it creates 1×1=1 match.
    • Coach Ashley (row 3): Also creates 1×1=1 match.
    • Coach Nick (row 4): Again, 1×1=1 match.
    • Coach Zoe (row 5): Another 1×1=1 match.

Total rows from this merge: 4+1+1+1+1=8 rows.

Second Merge: Now we take the result of the first merge (which has 8 rows) and merge it again with the original coach DataFrame on the “Coach” column.

  • Matching process:
    • Coach Jason 4 rows (from the first merge): each of these rows will match with the 2 rows for “Coach Jason” in the original coach DataFrame. This results in 4×2=8 matches.
    • Coach Jack, Coach Ashley, Coach Nick, and Coach Zoe(all have single row): each will match with their corresponding rows in the original coach DataFrame. Since each has 1×1=1 match, we add 4 more rows in total.

Total rows from this merge: 8+1+1+1+1=12 rows.

Result:
The expression returns the number of rows in the final merged DataFrame.
Result: The final DataFrame has 12 rows.


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 19%.



Problem 8

There’s been an expansion of Collegiate Mario Kart across the country, so to determine a champion, we’d like to set up a bracket for a Mario Kart tournament and fill it with four college teams from bigkart. We won’t display bigkart for this question, but assume it’s a larger version of kart: there are 50 rows in bigkart, and thus 50 teams. Selection is performed uniformly at random, so that each team has the same chance of being selected. Please leave your answers in unsimplified form: answers of the form (\frac{3}{4}) \cdot (\frac{2}{3}) or \left[1 - (\frac{1}{2})^4\right] are preferred.


Problem 8.1

Assume we populate our tournament by randomly selecting four teams from bigkart with replacement. What is the probability there are no duplicates among the four teams selected for the tournament? Do not simplify your answer.

Answer: \frac{1}{1} \cdot \frac{49}{50} \cdot \frac{48}{50} \cdot \frac{47}{50}

We need to find the probability that there are no duplicates among the four teams selected for the tournament from bigkart with replacement. Since we are selecting four times, we want each selected team to be unique.

  1. First Selection:
    • We can select any of the 50 teams, so the probability is \frac{50}{50} = 1.
  2. Second Selection:
    • To avoid duplicates, we must pick a team different from the first one.
    • The probability is \frac{49}{50}.
  3. Third Selection:
    • We must pick a team different from the first two selections.
    • The probability is \frac{48}{50}.
  4. Fourth Selection:
    • We must pick a team different from the first three selections.
    • The probability is \frac{47}{50}.

The total probability that there are no duplicates among the four teams selected is the product of these probabilities: \frac{1}{1} \cdot \frac{49}{50} \cdot \frac{48}{50} \cdot \frac{47}{50}


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.


Problem 8.2

Now, assume we populate our tournament by randomly selecting four teams from bigkart without replacement. Additionally, assume 30 are from Division 1 and 20 teams are from Division 2. What is the probability that there is at least one Division 2 team among the four teams selected for the tournament? Do not simplify your answer.

Answer: 1 - \frac{30}{50} \cdot \frac{29}{49} \cdot \frac{28}{48} \cdot \frac{27}{47}

We are selecting four teams from bigkart without replacement, and we want to calculate the probability that at least one Division 2 team is selected, which represented as P(A). We know that there are 30 Division 1 teams and 20 Division 2 teams.

First calculating the complement probability, P(Ac) which is the probability that all four teams are from Division 1.

  1. First Selection:
    • Probability of picking a Division 1 team is \frac{30}{50}.
  2. Second Selection:
    • After one Division 1 team is selected, there are 29 Division 1 teams left and 49 teams remaining in total.
    • The probability is \frac{29}{49}.
  3. Third Selection:
    • After two Division 1 teams are selected, there are 28 Division 1 teams left and 48 teams remaining in total.
    • The probability is \frac{28}{48}.
  4. Fourth Selection:
    • After three Division 1 teams are selected, there are 27 Division 1 teams left and 47 teams remaining in total.
    • The probability is \frac{27}{47}.

The probability that all four teams are from Division 1 is: \frac{30}{50} \cdot \frac{29}{49} \cdot \frac{28}{48} \cdot \frac{27}{47}

To find the probability of at least one Division 2 team being selected, we use P(A) = 1 - P(Ac): 1 - \frac{30}{50} \cdot \frac{29}{49} \cdot \frac{28}{48} \cdot \frac{27}{47}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.



Problem 9

We want to estimate the probability that the University of Michigan is among the four teams selected when schools are selected without replacement.

schools = np.array(kart.get("University"))
mystery_one = 0
num_trials = 10000
for i in np.arange(num_trials):
    bracket = __(i)__
    if "Michigan" in bracket:
        mystery_one = __(ii)__
mystery_two = mystery_one / num_trials


Problem 9.1

Fill in the blanks to complete a simulation.

Answer:

    1. np.random.choice(schools, 4, replace=False)
    1. mystery_one = mystery_one + 1
  1. np.random.choice() allows us to select random samples from the schools array. The correct syntax is np.random.choice(arr, size, replace=True, p=[p_0, p_1, …])
  • arr: The array to sample from. In this case, schools.
  • size: The number of elements to draw. In this case, 4, because we want to select four teams.
  • replace: A boolean value that determines if the same item can be chosen more than once. Since teams cannot be picked more than once in a single selection, we use replace=False.
  • p: An parameter that sets the probabilities for each item in arr. If omitted, the function assumes each item has an equal chance of being selected. In this problem, we do not need to specify p because we are performing uniform random sampling, meaning every team has an equal chance.
  1. This blank updates mystery_one, which counts how many times “Michigan” appears in the randomly selected teams across the trials. In each iteration of the loop, if “Michigan” is in the randomly selected bracket, we increment mystery_one by 1.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 9.2

What is the meaning of mystery_two after the code has finished running? ( ) The number of times Michigan was in the tournament ( ) The number of trials we ran ( ) The proportion of times Michigan was in the tournament ( ) None of these answers is what mystery_two represents

Answer: The proportion of times Michigan was in the tournament

If “Michigan” is found in bracket, mystery_one is incremented by 1. This means mystery_one keeps track of how many times Michigan appears in the four selected teams across all 10,000 trials. Therefore, at the end of the loop, mystery_one contains the total number of trials in which Michigan was selected.

mystery_two is calculated as mystery_one / num_trials. Since mystery_one is the count of trials where Michigan was selected, dividing it by num_trials (the total number of trials) gives the proportion of trials where Michigan was chosen among the four teams.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 9.3

For the next two parts only, imagine we wanted to simulate a 16-team tournament, where teams are selected with replacement. Which blank should be filled in?

Answer: blank (i)

When simulating a 16-team tournament, where teams are selected with replacement, Blank (i) should be used because that is where the selection occurs. We need to adjust this line to account for selecting more teams (16 teams) and to allow replacements.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 9.4

What code should be used to fill in the blank you selected above?

Answer: np.random.choice(schools, 16, replace=True)

We change size=4 to size=16 to select 16 teams, and replace=True allows the same team to be selected multiple times within a single trial.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 45%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.