Winter 2026 Final Exam

← return to practice.dsc10.com


Instructor(s): Peter Chi, Sam Lau

This exam was administered in-person. Students were allowed one page of double-sided handwritten notes. No calculators were allowed. Students had 3 hours to take this exam.

This exam covered material from the Winter 2026 offering of DSC 10.


Note (groupby / pandas 2.0): Pandas 2.0+ no longer silently drops columns that can’t be aggregated after a groupby, so code written for older pandas may behave differently or raise errors. In these practice materials we use .get() to select the column(s) we want after .groupby(...).mean() (or other aggregations) so that our solutions run on current pandas. On real exams you will not be penalized for omitting .get() when the old behavior would have produced the same answer.


We will work with a DataFrame called zoo containing information about animals at the San Diego Zoo. Each row corresponds to one animal. The DataFrame includes columns such as:

A preview of zoo is shown below.

Assume we have already run import babypandas as bpd and import numpy as np.


Problem 1

A silly capybara has scrambled up Jeffrey’s Python code! Unscramble the code so that after it runs, jeffrey is a single string containing the name of the exhibit with the largest mean reptile weight.

Available lines (each may be used once, more than once, or not at all):

For each of Lines 1–9, select a line from A–N such that the code produces the desired result when run in order. Your code may not need all 9 lines — use “Not used” for any remaining lines at the end. Line 1 has been filled in for you (E). Not all of the code above will be used, and some lines may be used more than once.

State the correct sequence of line letters for Lines 2–7 (and indicate which of Lines 8–9 are not used).

Answer: Line 1: E (given). Lines 2–7: C, A, N, J, L, H. Lines 8 and 9: Not used.

Working code:

jeffrey = zoo
jeffrey = jeffrey[jeffrey.get('kind') == 'Reptile']
jeffrey = jeffrey.get(['exhibit', 'weight_lb'])
jeffrey = jeffrey.groupby('exhibit')
jeffrey = jeffrey.mean()
jeffrey = jeffrey.sort_values(by='weight_lb', ascending=False)
jeffrey = jeffrey.index[0]

Filter to reptiles, keep exhibit and weight, group by exhibit, take mean weight per exhibit, sort descending so the first index is the exhibit with largest mean reptile weight.


Problem 2

Kate is the San Diego Zoo’s newest zookeeper! She creates the DataFrames first_5 and first_7 using the code below.

first_5 = zoo.take(np.arange(5))
first_7 = zoo.take(np.arange(7))


Problem 2.1

(a) Kate’s first merge call is shown below.

merged_a = first_7.merge(first_7, on='species')

How many rows does merged_a have? Write your answer as a single number in the box, or select Not Enough Information.

Answer: 17

In first_7, there are 2 giant pandas, 3 African elephants, and 2 polar bears. Merging first_7 with itself on species creates (2^2 + 3^2 + 2^2 = 17) rows.


Problem 2.2

(b) Kate’s second merge call is shown below.

merged_b = merged_a.merge(first_5, on='species')

How many rows does merged_b have? Write your answer as a single number in the box, or select Not Enough Information.

Answer: 35

The merge from part (a) has 17 rows. Merging that result with first_5 keeps only the pandas and elephants, giving (2^3 + 3^3 = 8 + 27 = 35) rows.



Problem 3

For this problem, let weight and age be defined by:

weight = np.array(zoo.get('weight_lb'))
age = np.array(zoo.get('age'))


Problem 3.1

(a) Avi wants to quickly estimate each animal’s weight in kilograms instead of pounds using the shortcut ( ).

Which of the following expressions output an array of the animal weights in kilograms using this shortcut? Select all that apply.

Answer: A and C

  • A: Multiplies pounds by 0.45 directly.
  • C: weight * 0.90 / 2 equals 0.45 * weight.
  • B subtracts a scalar (wrong).
  • D divides instead of multiplying.


Problem 3.2

(b) Michelle wants to find the range of age, i.e., the age of the oldest zoo animal minus the age of the youngest zoo animal.

Which of the following expressions incorrectly computes this value?

Answer: B

age - age.max() is always non-positive, so .min() gives the negative of the range, not the range.



Problem 4

Bianca is writing helper functions for the San Diego Zoo’s animal tracking tools. Complete the code below so that mystery returns a string made of the first letter of each word in a species name. Then, Bianca can use blank (b) to create bianca, a copy of zoo with one extra column called initials. For example, mystery("Giant Panda") should return "GP".

def mystery(value):
    _____(a)_____
    return result

bianca = zoo.assign(initials=_____(b)_____)


Problem 4.1

(a) Which snippets could replace blank (a)? Select all that apply.

Snippet 1.

result = ""
for word in value:
    result = result + word[0]

Snippet 2.

words = value.split()
result = ""
for word in words:
    result = result + word[0]

Snippet 3.

temp = bpd.DataFrame().assign(word=value.split())
result = temp.get('word').get(0)

Snippet 4.

words = np.array(value.split())
for i in np.arange(len(words)):
    words[i] = words[i][0]
result = ''.join(words)

Answer: Snippet 2 and Snippet 4

  • Snippet 2: Splits the species name into words, then concatenates the first letter of each word.
  • Snippet 4: Puts words in a NumPy array, replaces each word by its first character, then joins into one string.
  • Snippet 1: Loops over characters of value, not words, so it does not produce initials.
  • Snippet 3: Does not build the full initials string from all words in the intended way (see exam rubric).


Problem 4.2

(b) Which snippets could replace blank (b)? Select all that apply.

Answer: Only zoo.get('species').apply(mystery) (the first option).

Apply mystery to each value in the species column. The other options apply mystery incorrectly or call a method that does not exist.



Problem 5

Sam has subsetted zoo to include only the animals whose status is "Endangered". Suppose that in Sam’s subset:


Problem 5.1

(a) What is the probability that a randomly selected animal from Sam’s sample lives in the Asian Passage but is not a mammal? Give your answer as a simplified fraction, or select Not Enough Information.

Answer: ()

Let (A) = mammal, (B) = Asian Passage. (P(B) = ), (P(A B) = ). Then (P(B A) = P(B) - P(A B) = - = ).


Problem 5.2

(b) Sam chooses two animals uniformly at random with replacement from his sample. What is the probability that at least one of the two animals is a mammal living in the Asian Passage? Give your answer as a simplified fraction, or select Not Enough Information.

Answer: ()

One draw: (P(A B) = ). Probability neither draw is that event: (()^2 = ). So (P() = 1 - = ).



Problem 6

Sofia gets a summer internship at the San Diego Zoo. Her first task is to study the relationship between conservation status and how much food an animal eats each day. She treats zoo as an SRS of animals from zoos worldwide. Complete the code below so that bootstrap(zoo) returns a 75% bootstrap confidence interval for the statistic:

median daily food of Endangered animals − median daily food of Vulnerable animals.

def bootstrap(df):
    vulnerable = df[df.get('status') == 'Vulnerable']
    endangered = df[df.get('status') == 'Endangered']
    median_diffs = []
    for i in range(1000):
        _____(a)_____
    return np.percentile(median_diffs, _____(b)_____)


Problem 6.1

(a) Which snippets could replace blank (a)? Select all that apply.

Answer: The second and third options (bootstrap within each status group, then difference of medians). The first samples from the full df instead of within each group.


Problem 6.2

(b) Which expression should replace blank (b)?

Answer: [12.5, 87.5] — use np.percentile(median_diffs, [12.5, 87.5]) for the middle 75% (12.5% in each tail).



Problem 7

Assume Peter ran bootstrap code similar to lecture to bootstrap the age column from zoo 5,000 times and stored the median of each bootstrap sample in ages5000. He then bootstraps the median age 1,000 more times and stores results in ages1000:

>>> ages1000
array([5, 9, 12, ..., 16, 19, 12], shape=(1000,))
>>> ages5000
array([12, 7, 13, ..., 11, 11, 16], shape=(5000,))


Problem 7.1

(a) Suppose Peter creates a 95% bootstrap confidence interval using ages5000. Select all true statements about this interval.

Answer: Only the fourth statement.

  • CLT width formulas apply to means, not medians (first false).
  • Midpoint of percentile interval need not equal the sample median (second false).
  • Bootstrap still needs a representative sample for an external population (third false).
  • If zoo is the entire SD Zoo census, the population median is known — no CI needed (fourth true).


Problem 7.2

(b) Select all true statements about ages5000 and ages1000.

Answer: The third and fourth statements.

More resamples improve Monte Carlo accuracy of percentiles but do not systematically change the variance of the bootstrap distribution of the statistic (first and second false).



Problem 8

Raymond wants to estimate the average daily food consumption (pounds) of animals at the San Diego Zoo. He runs my_sample = zoo.sample(30). His sample consumes 1200 pounds of food per day in total, with a sample standard deviation of approximately 88.

Use ( ) where hints suggest.


Problem 8.1

(a) A 95% CLT-based confidence interval for the average daily food can be written as ([x,, 9x]) where (x > 0). What is (x)? (Simplified fraction or whole number, or Not Enough Information.)

Answer: 8

Width (= 9x - x = 8x). For 95% CLT, width ( = 64), so (8x = 64 x = 8).


Problem 8.2

(b) Let (z > 0) so that some normal tail fraction uses (z) SDs. With the same sample, Raymond makes a valid CLT-based CI whose right endpoint is () times the midpoint. What is (z)? (Simplified fraction or whole number, or Not Enough Information.)

(Hint: ( ).)

Answer: ()

Sample mean (= 1200/30 = 40). Right endpoint ( = 50), so margin (= 10 = z 16z), hence (z = 10/16 = 5/8).


Problem 8.3

(c) Ella takes an SRS at another zoo and builds a 95% CLT interval for average daily food. Raymond uses 90% with his SD Zoo sample. Ella’s interval is narrower than Raymond’s. Select the best explanation.

Answer: Yes — Ella’s sample might have smaller standard deviation.

A lower confidence level tends to narrow an interval, but Ella’s interval can still be narrower at 95% than Raymond’s at 90% if her SD (or (n)) is much better.


Problem 8.4

(d) Assume 12,000 animals at the SD Zoo. If a CLT interval for the average daily food is ([a, b]), what is the interval for total daily food?

Answer: ([12000a, 12000b]) — multiply both endpoints by population size (N).



Problem 9

Punch is a baby Japanese macaque; zookeepers recorded his last 50 interactions (20 friendly). For a typical macaque, 50% of interactions are friendly. Minchan tests whether Punch is treated worse (fewer friendly) using:

Assume total_variation_distance(dist1, dist2) computes TVD as in class. Code skeleton:

A. expected = 50 * 0.5
B. obs = abs(20 - expected)
C. stats = np.array([])
D. for i in np.arange(1000):
E.     n = np.random.multinomial(50, [0.5, 0.5])[0]
F.     stat = abs(n - expected)
G.     stats = np.append(stats, stat)
H. p_value = np.count_nonzero(stats >= obs) / 1000


Problem 9.1

(a) If Minchan runs the code unchanged, select all true statements.

Answer: Only the third statement — using abs makes the test effectively two-sided, roughly doubling the one-sided (p)-value.


Problem 9.2

(b) Line B should be:

Answer: obs = 20 - expected (one-sided: count friendly minus expected under null; no abs).


Problem 9.3

(c) Line E should be:

Answer: Left as-is — multinomial ((50, [0.5, 0.5])) and take [0] for count of friendlies under the null.


Problem 9.4

(d) Line F should be:

Answer: stat = n - expected (match the observed statistic without abs).


Problem 9.5

(e) Line H should be:

Answer: np.count_nonzero(stats <= obs) / 1000 — one-sided alternative “less than 0.5” means more extreme = smaller or equal stat; divide by number of simulations.



Problem 10

Austin tests whether mammals and birds have the same proportion of “fast” animals (max speed > 40 mph). He uses a DataFrame df with:

df keeps only rows of zoo where kind is Mammal or Bird. Assume df is representative of all mammals and birds for inference where stated.


Problem 10.1

(a) Select all valid statements of the null hypothesis for Austin’s permutation test.

Answer: First, second, and fifth — population-level “same distribution / same proportion / no association.” Not “in df only,” not speeds, not 0.5.


Problem 10.2

(b) Select the valid alternative hypothesis.

Answer: First — difference in proportions () (two-sided). Others are wrong conditioning or wrong null value.


Problem 10.3

(c) Select a valid test statistic for this test.

Answer: None of the choices above are valid — for a two-sided test on proportions you need a symmetric statistic (e.g. absolute difference in proportions or TVD on is_fast by kind); signed difference is for one-sided; other options are wrong per exam solution.


Problem 10.4

(d) Correct way to simulate under the null:

Answer: Shuffle kind — permutation test breaks association under “same distribution.”


Problem 10.5

(e) Consider

def stat(tbl):
    props = tbl.groupby("kind").mean().get("is_fast")
    return np.std(props)

Using stat as the test statistic with 1,000 null simulations, select all true statements.

Answer: Second, fourth, fifth, sixth — (([p_m, p_b]) = |p_m - p_b|/2), nonnegative, peak near 0 under null; proportional to abs diff and TVD for binary groups.


Problem 10.6

(f) Austin wants a 95% CLT interval for the proportion of mammals that are fast, with width at most 0.1 for any true proportion. Minimum number of mammals to sample?

Answer: 400 — worst case (p = 0.5): width ( = 2/ n ).



Problem 11

Hemanth fits a regression line predicting daily_food_lb from weight_lb:

[ = 0.04 + 2 ]

Given: (r = 0.8), mean of weight_lb is 200, SD of weight_lb is 100, SD of daily_food_lb is 5.


Problem 11.1

(a) Select all statements that must be true.

Answer: None of the above — (r=0.8) is not “80% on line”; OLS minimizes MSE; bootstrap could repeat rows.


Problem 11.2

(b) What is the mean of daily_food_lb? (Single number.)

Answer: 10 — line passes through ((x, y)): (0.04 + 2 = 10).


Problem 11.3

(c) Convert both variables to kg ((1) lb () kg) and refit predicting daily_food_kg from weight_kg. Compared to the original line, does each increase, decrease, or stay the same?

Answer: Slope same, intercept decreases, r same — same linear relationship; both axes scaled by 0.45 leaves slope and (r); intercept scales with (y).


Problem 11.4

(d) Standardize weight_lb only, then refit predicting daily_food_lb. Compared to the original line, does each increase, decrease, or stay the same?

Answer: Slope increases (equals (r SD_y) in (SD_y) per 1 SD (x), i.e. (0.04 = 4)), intercept increases to (y = 10), r same.


Problem 11.5

(e) Bootstrap 1000 times; each time predict daily_food_lb for 150 lb and 400 lb animals. For which weight will the 1000 predictions vary more?

Answer: 400 lb — lines pivot near ((x, y)); farther from (x = 200) means higher variance of fitted value.



Problem 12

Congratulations on finishing DSC 10! Before you turn in your exam:

  1. Make sure you’ve written your PID on each page where required.
  2. Fill bubbles and squares completely.

If you’d like, you may draw a picture about DSC 10 in the space provided on the paper exam.

(This question is not graded.)

(No graded solution.)


👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.