Discussion 9: Total Variation Distance and Permutation Testing

← return to practice.dsc10.com


The problems in this worksheet are taken from past exams. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.

The DataFrame breakdown has 4 rows and 50 columns – one row for each of the 4 shot types mentioned above, and one column for each of 50 different players. Each column of breakdown describes the distribution of shot types for a single player.

The first few columns of breakdown are shown below.

For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.


Problem 1.1

Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.


What is the total variation distance (TVD) between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.


Problem 1.2

Recall, breakdown has information for 50 different players. We want to find the player whose shot distribution is the most similar to Kelsey Plum, i.e. has the lowest TVD with Kelsey Plum’s shot distribution.

Fill in the blanks below so that most_sim_player evaluates to the name of the player with the most similar shot distribution to Kelsey Plum. Assume that the column named 'Kelsey Plum' is the first column in breakdown (and again that breakdown has 50 columns total).

most_sim_player = ''
lowest_tvd_so_far = __(a)__
other_players = np.array(breakdown.columns).take(__(b)__)
for player in other_players:
    player_tvd = tvd(breakdown.get('Kelsey Plum'),
                     breakdown.get(player))
    if player_tvd < lowest_tvd_so_far:
        lowest_tvd_so_far = player_tvd
        __(c)__
  1. What goes in blank (a)?
  1. What goes in blank (b)?

  2. What goes in blank (c)?



Problem 2

You survey 100 DSC majors and 140 CSE majors to ask them which video streaming service they use most. The resulting distributions are given in the table below. Note that each column sums to 1.

Service DSC Majors CSE Majors
Netflix 0.4 0.35
Hulu 0.25 0.2
Disney+ 0.1 0.1
Amazon Prime Video 0.15 0.3
Other 0.1 0.05

For example, 20% of CSE Majors said that Hulu is their most used video streaming service. Note that if a student doesn’t use video streaming services, their response is counted as Other.


Problem 2.1

What is the total variation distance (TVD) between the distribution for DSC majors and the distribution for CSE majors? Give your answer as an exact decimal.


Problem 2.2

Suppose we only break down video streaming services into four categories: Netflix, Hulu, Disney+, and Other (which now includes Amazon Prime Video). Now we recalculate the TVD between the two distributions. How does the TVD now compare to your answer to part (a)?



Problem 3

In some cities, the number of sunshine hours per month is relatively consistent throughout the year. São Paulo, Brazil is one such city; in all months of the year, the number of sunshine hours per month is somewhere between 139 and 173. New York City’s, on the other hand, ranges from 139 to 268.

Gina and Abel, both San Diego natives, are interested in assessing how “consistent" the number of sunshine hours per month in San Diego appear to be. Specifically, they’d like to test the following hypotheses:

As their test statistic, Gina and Abel choose the total variation distance. To simulate samples under the null, they will sample from a categorical distribution with 12 categories — January, February, and so on, through December — each of which have an equal probability of being chosen.


Problem 3.1

In order to run their hypothesis test, Gina and Abel need a way to calculate their test statistic. Below is an incomplete implementation of a function that computes the TVD between two arrays of length 12, each of which represent a categorical distribution.

    def calculate_tvd(dist1, dist2):
        return np.mean(np.abs(dist1 - dist2)) * ____

Fill in the blank so that calculate_tvd works as intended.


Moving forward, assume that calculate_tvd works correctly.

Now, complete the implementation of the function uniform_test, which takes in an array observed_counts of length 12 containing the number of sunshine hours each month in a city and returns the p-value for the hypothesis test stated at the start of the question.

    def uniform_test(observed_counts):
        # The values in observed_counts are counts, not proportions!
        total_count = observed_counts.sum()
        uniform_dist = __(b)__
        tvds = np.array([])
        for i in np.arange(10000):
            simulated = __(c)__
            tvd = calculate_tvd(simulated, __(d)__)
            tvds = np.append(tvds, tvd)
        return np.mean(tvds __(e)__ calculate_tvd(uniform_dist, __(f)__))


Problem 3.2

What goes in blank (b)? (Hint: The function np.ones(k) returns an array of length k in which all elements are 1.)



Problem 3.3

What goes in blank (c)?



Problem 3.4

What goes in blank (d)?



Problem 3.5

What goes in blank (e)?


Problem 3.6

What goes in blank (f)?



Problem 4

For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.

Kelsey Plum, a WNBA player, attended La Jolla Country Day School, which is adjacent to UCSD’s campus. Her current team is the Las Vegas Aces (three-letter code 'LVA'). In 2021, the Las Vegas Aces played 31 games, and Kelsey Plum played in all 31.

The DataFrame plum contains her stats for all games the Las Vegas Aces played in 2021. The first few rows of plum are shown below (though the full DataFrame has 31 rows, not 5):

Each row in plum corresponds to a single game. For each game, we have:

Consider the definition of the function diff_in_group_means:

def diff_in_group_means(df, group_col, num_col):
    s = df.groupby(group_col).mean().get(num_col)
    return s.loc[False] - s.loc[True]


Problem 4.1

It turns out that Kelsey Plum averages 0.61 more assists in games that she wins (“winning games”) than in games that she loses (“losing games”). Fill in the blanks below so that observed_diff evaluates to -0.61.

observed_diff = diff_in_group_means(plum, __(a)__, __(b)__)
  1. What goes in blank (a)?

  2. What goes in blank (b)?


Problem 4.2

After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:

To conduct our permutation test, we place the following code in a for-loop.


won = plum.get('Won')
ast = plum.get('AST')
shuffled = plum.assign(Won_shuffled=np.random.permutation(won)) \
               .assign(AST_shuffled=np.random.permutation(ast))

Which of the following options does not compute a valid simulated test statistic for this permutation test?


Problem 4.3

Suppose we generate 10,000 simulated test statistics, using one of the valid options from part 1. The empirical distribution of test statistics, with a red line at observed_diff, is shown below.

Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?



Problem 5

An IKEA fan created an app where people can log the amount of time it took them to assemble their IKEA furniture. The DataFrame app_data has a row for each product build that was logged on the app. The column 'product' contains the name of the product, and the column 'minutes' contains integer values representing the number of minutes it took to assemble each product.

You are browsing the IKEA showroom, deciding whether to purchase the BILLY bookcase or the LOMMARP bookcase. You are concerned about the amount of time it will take to assemble your new bookcase, so you look up the assembly times reported in app_data. Thinking of the data in app_data as a random sample of all IKEA purchases, you want to perform a permutation test to test the following hypotheses.

Null Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from the same distribution.

Alternative Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from different distributions.


Problem 5.1

Suppose we query app_data to keep only the BILLY bookcases, then average the 'minutes' column. In addition, we separately query app_data to keep only the LOMMARP bookcases, then average the 'minutes' column. If the null hypothesis is true, which of the following statements about these two averages is correct?


Problem 5.2

For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.

Complete the code below to generate one simulated value of the test statistic in a new way, without using np.random.permutation.

billy = (app_data.get('product') == 
        'BILLY Bookcase, white, 31 1/2x11x79 1/2')
lommarp = (app_data.get('product') == 
          'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
billy_lommarp = app_data[billy|lommarp]
billy_mean = np.random.choice(billy_lommarp.get('minutes'), billy.sum(), replace=False).mean()
lommarp_mean = _________
billy_mean - lommarp_mean

What goes in the blank?



Problem 6

The DataFrame apps contains application data for a random sample of 1,000 applicants for a particular credit card from the 1990s. The columns are:

The first few rows of apps are shown below, though remember that apps has 1,000 rows.




In apps, our sample of 1,000 credit card applications, applicants who were approved for the credit card have fewer dependents, on average, than applicants who were denied. The mean number of dependents for approved applicants is 0.98, versus 1.07 for denied applicants.

To test whether this difference is purely due to random chance, or whether the distributions of the number of dependents for approved and denied applicants are truly different in the population of all credit card applications, we decide to perform a permutation test.

Consider the incomplete code block below.

def shuffle_status(df):
    shuffled_status = np.random.permutation(df.get("status"))
    return df.assign(status=shuffled_status).get(["status", "dependents"])

def test_stat(df):
    grouped = df.groupby("status").mean().get("dependents")
    approved = grouped.loc["approved"]
    denied = grouped.loc["denied"]
    return __(a)__

stats = np.array([])
for i in np.arange(10000):
    shuffled_apps = shuffle_status(apps)
    stat = test_stat(shuffled_apps)
    stats = np.append(stats, stat)

p_value = np.count_nonzero(__(b)__) / 10000

Below are six options for filling in blanks (a) and (b) in the code above.

Blank (a) Blank (b)
Option 1 denied - approved stats >= test_stat(apps)
Option 2 denied - approved stats <= test_stat(apps)
Option 3 approved - denied stats >= test_stat(apps)
Option 4 np.abs(denied - approved) stats >= test_stat(apps)
Option 5 np.abs(denied - approved) stats <= test_stat(apps)
Option 6 np.abs(approved - denied) stats >= test_stat(apps)

The correct way to fill in the blanks depends on how we choose our null and alternative hypotheses.


Problem 6.1

Suppose we choose the following pair of hypotheses.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.


Problem 6.2

Now, suppose we choose the following pair of hypotheses.

Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.


Problem 6.3

Option 6 from the start of this question is repeated below.

Blank (a) Blank (b)
Option 6 np.abs(approved - denied) stats >= test_stat(apps)

We want to create a new option, Option 7, that replicates the behavior of Option 6, but with blank (a) filled in as shown:

Blank (a) Blank (b)
Option 7 approved - denied

Which expression below could go in blank (b) so that Option 7 is equivalent to Option 6?



Problem 6.4

In our implementation of this permutation test, we followed the procedure outlined in lecture to draw new pairs of samples under the null hypothesis and compute test statistics — that is, we randomly assigned each row to a group (approved or denied) by shuffling one of the columns in apps, then computed the test statistic on this random pair of samples.

Let’s now explore an alternative solution to drawing pairs of samples under the null hypothesis and computing test statistics. Here’s the approach:

  1. Shuffle, i.e. re-order, the rows of the DataFrame.
  2. Use the values at the top of the resulting "dependents" column as the new “denied” sample, and the values at the at the bottom of the resulting "dependents" column as the new “approved” sample. Note that we don’t necessarily split the DataFrame exactly in half — the sizes of these new samples depend on the number of “denied” and “approved” values in the original DataFrame!

Once we generate our pair of random samples in this way, we’ll compute the test statistic on the random pair, as usual. Here, we’ll use as our test statistic the difference between the mean number of dependents for denied and approved applicants, in the order denied minus approved.

Fill in the blanks to complete the simulation below.

Hint: np.random.permutation shouldn’t appear anywhere in your code.

    def shuffle_all(df):
        '''Returns a DataFrame with the same rows as df, but reordered.'''
        return __(a)__

    def fast_stat(df):
        # This function does not and should not contain any randomness.
        denied = np.count_nonzero(df.get("status") == "denied")
        mean_denied = __(b)__.get("dependents").mean()
        mean_approved = __(c)__.get("dependents").mean()
        return mean_denied - mean_approved

    stats = np.array([])
    for i in np.arange(10000):
        stat = fast_stat(shuffle_all(apps))
        stats = np.append(stats, stat)



Problem 7

Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.

We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.

Below, we have a snippet of working code for this hypothesis test, for a specific test statistic. Assume that adelie_chinstrap is a DataFrame of only Adelie and Chinstrap penguins, with just two columns – 'species' and 'mass'.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
    # --- line (a) starts ---
    shuffled = np.random.permutation(adelie_chinstrap.get('species'))
    # --- line (a) ends ---
    
    # --- line (b) starts ---
    with_shuffled = adelie_chinstrap.assign(species=shuffled)
    # --- line (b) ends ---

    grouped = with_shuffled.groupby('species').mean()

    # --- line (c) starts ---
    stat = grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
    # --- line (c) ends ---

    stats = np.append(stats, stat)


Problem 7.1

Which of the following statements best describe the procedure above?


Problem 7.2

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
    # --- line (a) starts ---
    shuffled = np.random.permutation(adelie_chinstrap.get('species'))
    # --- line (a) ends ---
    
    # --- line (b) starts ---
    with_shuffled = adelie_chinstrap.assign(species=shuffled)
    # --- line (b) ends ---

    grouped = with_shuffled.groupby('species').mean()

    # --- line (c) starts ---
    stat = grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
    # --- line (c) ends ---

    stats = np.append(stats, stat)

What would happen if we removed line (a), and replaced line (b) with

with_shuffled = adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False)

Select the best answer.


Problem 7.3

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
    # --- line (a) starts ---
    shuffled = np.random.permutation(adelie_chinstrap.get('species'))
    # --- line (a) ends ---
    
    # --- line (b) starts ---
    with_shuffled = adelie_chinstrap.assign(species=shuffled)
    # --- line (b) ends ---

    grouped = with_shuffled.groupby('species').mean()

    # --- line (c) starts ---
    stat = grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
    # --- line (c) ends ---

    stats = np.append(stats, stat)

What would happen if we removed line (a), and replaced line (b) with

with_shuffled = adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True)

Select the best answer.


Problem 7.4

For your convenience, we copy the code for the hypothesis test below.

stats = np.array([])
num_reps = 500
for i in np.arange(num_reps):
    # --- line (a) starts ---
    shuffled = np.random.permutation(adelie_chinstrap.get('species'))
    # --- line (a) ends ---
    
    # --- line (b) starts ---
    with_shuffled = adelie_chinstrap.assign(species=shuffled)
    # --- line (b) ends ---

    grouped = with_shuffled.groupby('species').mean()

    # --- line (c) starts ---
    stat = grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
    # --- line (c) ends ---

    stats = np.append(stats, stat)

What would happen if we replaced line (a) with

with_shuffled = adelie_chinstrap.assign(
    species=np.random.permutation(adelie_chinstrap.get('species')
)

and replaced line (b) with

with_shuffled = with_shuffled.assign(
    mass=np.random.permutation(adelie_chinstrap.get('mass')
)

Select the best answer.


Problem 7.5

Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.

Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?



Problem 8

Choose the best tool to answer each of the following questions. Note the following:


Problem 8.1

Are incomes of applicants with 2 or fewer dependents drawn randomly from the distribution of incomes of all applicants?


Problem 8.2

What is the median income of credit card applicants with 2 or fewer dependents?


Problem 8.3

Are credit card applications approved through a random process in which 50% of applications are approved?


Problem 8.4

Is the median income of applicants with 2 or fewer dependents less than the median income of applicants with 3 or more dependents?


Problem 8.5

What is the difference in median income of applicants with 2 or fewer dependents and applicants with 3 or more dependents?



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.