← return to practice.dsc10.com
These problems are taken from past quizzes and exams. Work on them
on paper, since the quizzes and exams you take in this
course will also be on paper.
We encourage you to complete these
problems during discussion section. Solutions will be made available
after all discussion sections have concluded. You don’t need to submit
your answers anywhere.
Note: We do not plan to cover all of
these problems during the discussion section; the problems we don’t
cover can be used for extra practice.
Arya was curious how many UCSD students used Hulu over Thanksgiving break. He surveys 250 students and finds that 130 of them did use Hulu over break and 120 did not.
Using this data, Arya decides to test following hypotheses:
Null Hypothesis: Over Thanksgiving break, an equal number of UCSD students did use Hulu and did not use Hulu.
Alternative Hypothesis: Over Thanksgiving break, more UCSD students did use Hulu than did not use Hulu.
Which of the following could be used as a test statistic for the hypothesis test?
The proportion of students who did use Hulu minus the proportion of students who did not use Hulu.
The absolute value of the proportion of students who did use Hulu minus the proportion of students who did not use Hulu.
The proportion of students who did use Hulu plus the proportion of students who did not use Hulu.
The absolute value of the proportion of students who did use Hulu plus the proportion of students who did not use Hulu.
For the test statistic that you chose in part (a), what is the observed value of the statistic? Give your answer either as an exact decimal or a simplified fraction.
If the p-value of the hypothesis test is 0.053, what can we conclude, at the standard 0.05 significance level?
We reject the null hypothesis.
We fail to reject the null hypothesis.
We accept the null hypothesis.
At the San Diego Model Railroad Museum, there are different admission prices for children, adults, and seniors. Over a period of time, as tickets are sold, employees keep track of how many of each type of ticket are sold. These ticket counts (in the order child, adult, senior) are stored as follows.
= np.array([550, 1550, 400]) admissions_data
Complete the code below so that it creates an array
admissions_proportions
with the proportions of tickets sold
to each group (in the order child, adult, senior).
def as_proportion(data):
return __(a)__
= as_proportion(admissions_data) admissions_proportions
What goes in blank (a)?
The museum employees have a model in mind for the proportions in which they sell tickets to children, adults, and seniors. This model is stored as follows.
= np.array([0.25, 0.6, 0.15]) model
We want to conduct a hypothesis test to determine whether the admissions data we have is consistent with this model. Which of the following is the null hypothesis for this test?
Child, adult, and senior tickets might plausibly be purchased in proportions 0.25, 0.6, and 0.15.
Child, adult, and senior tickets are purchased in proportions 0.25, 0.6, and 0.15.
Child, adult, and senior tickets might plausibly be purchased in proportions other than 0.25, 0.6, and 0.15.
Child, adult, and senior tickets, are purchased in proportions other than 0.25, 0.6, and 0.15.
Which of the following test statistics could we use to test our hypotheses? Select all that could work.
sum of differences in proportions
sum of squared differences in proportions
mean of differences in proportions
mean of squared differences in proportions
none of the above
Below, we’ll perform the hypothesis test with a different test statistic, the mean of the absolute differences in proportions.
Recall that the ticket counts we observed for children, adults, and
seniors are stored in the array
admissions_data = np.array([550, 1550, 400])
, and that our
model is model = np.array([0.25, 0.6, 0.15])
.
For our hypothesis test to determine whether the admissions data is
consistent with our model, what is the observed value of the test
statistic? Input your answer as a decimal between 0 and 1. Round to
three decimal places. (Suppose that the value you calculated is assigned
to the variable observed_stat
, which you will use in later
questions.)
Now, we want to simulate the test statistic 10,000 times under the
assumptions of the null hypothesis. Fill in the blanks below to complete
this simulation and calculate the p-value for our hypothesis test.
Assume that the variables admissions_data
,
admissions_proportions
, model
, and
observed_stat
are already defined as specified earlier in
the question.
= np.array([])
simulated_stats for i in np.arange(10000):
= as_proportions(np.random.multinomial(__(a)__, __(b)__))
simulated_proportions = __(c)__
simulated_stat = np.append(simulated_stats, simulated_stat)
simulated_stats
= __(d)__ p_value
What goes in blank (a)? What goes in blank (b)? What goes in blank (c)? What goes in blank (d)?
True or False: the p-value represents the probability that the null hypothesis is true.
True
False
The new statistic that we used for this hypothesis test, the mean of
the absolute differences in proportions, is in fact closely related to
the total variation distance. Given two arrays of length three,
array_1
and array_2
, suppose we compute the
mean of the absolute differences in proportions between
array_1
and array_2
and store the result as
madp
. What value would we have to multiply
madp
by to obtain the total variation distance
array_1
and array_2
? Input your answer below,
rounding to three decimal places.
For this question, let’s think of the data in app_data
as a random sample of all IKEA purchases and use it to test the
following hypotheses.
Null Hypothesis: IKEA sells an equal amount of beds
(category 'bed'
) and outdoor furniture (category
'outdoor'
).
Alternative Hypothesis: IKEA sells more beds than outdoor furniture.
The DataFrame app_data
contains 5000 rows, which form
our sample. Of these 5000 products,
Which of the following could be used as the test statistic for this hypothesis test? Select all that apply.
Among 2500 beds and outdoor furniture items, the absolute difference between the proportion of beds and the proportion of outdoor furniture.
Among 2500 beds and outdoor furniture items, the proportion of beds.
Among 2500 beds and outdoor furniture items, the number of beds.
Among 2500 beds and outdoor furniture items, the number of beds plus the number of outdoor furniture items.
Let’s do a hypothesis test with the following test statistic: among 2500 beds and outdoor furniture items, the proportion of outdoor furniture minus the proportion of beds.
Complete the code below to calculate the observed value of the test
statistic and save the result as obs_diff
.
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = ( ___(a)___ - ___(b)___ ) / ___(c)___ obs_diff
The table below contains several Python expressions. Choose the correct expression to fill in each of the three blanks. Three expressions will be used, and two will be unused.
Which of the following is a valid way to generate one value of the test statistic according to the null model? Select all that apply.
Way 1:
= np.random.multinomial(2500, [0.5,0.5])
multi 0] - multi[1])/2500 (multi[
Way 2:
= np.random.multinomial(2500, [0.5,0.5])[0]/2500
outdoor = np.random.multinomial(2500, [0.5,0.5])[1]/2500
bed - bed outdoor
Way 3:
= np.random.choice([0, 1], 2500, replace=True)
choice = choice.sum( )
choice_sum - (2500 - choice_sum))/2500 (choice_sum
Way 4:
= np.random.choice(['bed', 'outdoor'], 2500, replace=True)
choice = np.count_nonzero(choice=='bed')
bed = np.count_nonzero(choice=='outdoor')
outdoor /2500 - bed/2500 outdoor
Way 5:
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = app_data[outdoor|bed].sample(2500, replace=True)
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500) samp[samp.get(
Way 6:
= (app_data.get('category')=='outdoor')
outdoor = (app_data.get('category')=='bed')
bed = (app_data[outdoor|bed].groupby('category').count( ).reset_index( ).sample(2500, replace=True))
samp 'category')=='outdoor'].shape[0]/2500 - samp[samp.get('category')=='bed'].shape[0]/2500 samp[samp.get(
Way 1
Way 2
Way 3
Way 4
Way 5
Way 6
Suppose we generate 10,000 simulated values of the test statistic
according to the null model and store them in an array called
simulated_diffs
. Complete the code below to calculate the
p-value for the hypothesis test.
/10000 np.count_nonzero(simulated_diffs _________ obs_diff)
What goes in the blank?
<
<=
>
>=
Let’s suppose there are 4 different types of shots a basketball player can take – layups, midrange shots, threes, and free throws.
The DataFrame breakdown
has 4 rows and 50 columns – one
row for each of the 4 shot types mentioned above, and one column for
each of 50 different players. Each column of breakdown
describes the distribution of shot types for a single player.
The first few columns of breakdown
are shown below.
For instance, 30% of Kelsey Plum’s shots are layups, 30% of her shots are midrange shots, 20% of her shots are threes, and 20% of her shots are free throws.
Below, we’ve drawn an overlaid bar chart showing the shot distributions of Kelsey Plum and Chiney Ogwumike, a player on the Los Angeles Sparks.
What is the total variation distance (TVD) between Kelsey Plum’s shot distribution and Chiney Ogwumike’s shot distribution? Give your answer as a proportion between 0 and 1 (not a percentage) rounded to three decimal places.
Recall, breakdown
has information for 50 different
players. We want to find the player whose shot distribution is the
most similar to Kelsey Plum, i.e. has the lowest TVD
with Kelsey Plum’s shot distribution.
Fill in the blanks below so that most_sim_player
evaluates to the name of the player with the most similar shot
distribution to Kelsey Plum. Assume that the column named
'Kelsey Plum'
is the first column in breakdown
(and again that breakdown
has 50 columns total).
= ''
most_sim_player = __(a)__
lowest_tvd_so_far = np.array(breakdown.columns).take(__(b)__)
other_players for player in other_players:
= tvd(breakdown.get('Kelsey Plum'),
player_tvd
breakdown.get(player))if player_tvd < lowest_tvd_so_far:
= player_tvd
lowest_tvd_so_far __(c)__
-1
-0.5
0
0.5
1
np.array([])
''
What goes in blank (b)?
What goes in blank (c)?
You survey 100 DSC majors and 140 CSE majors to ask them which video streaming service they use most. The resulting distributions are given in the table below. Note that each column sums to 1.
Service | DSC Majors | CSE Majors |
---|---|---|
Netflix | 0.4 | 0.35 |
Hulu | 0.25 | 0.2 |
Disney+ | 0.1 | 0.1 |
Amazon Prime Video | 0.15 | 0.3 |
Other | 0.1 | 0.05 |
For example, 20% of CSE Majors said that Hulu is their most used video streaming service. Note that if a student doesn’t use video streaming services, their response is counted as Other.
What is the total variation distance (TVD) between the distribution for DSC majors and the distribution for CSE majors? Give your answer as an exact decimal.
Suppose we only break down video streaming services into four categories: Netflix, Hulu, Disney+, and Other (which now includes Amazon Prime Video). Now we recalculate the TVD between the two distributions. How does the TVD now compare to your answer to part (a)?
less than (a)
equal to (a)
greater than (a)
In some cities, the number of sunshine hours per month is relatively consistent throughout the year. São Paulo, Brazil is one such city; in all months of the year, the number of sunshine hours per month is somewhere between 139 and 173. New York City’s, on the other hand, ranges from 139 to 268.
Gina and Abel, both San Diego natives, are interested in assessing how “consistent" the number of sunshine hours per month in San Diego appear to be. Specifically, they’d like to test the following hypotheses:
Null Hypothesis: The number of sunshine hours per month in San Diego is drawn from the uniform distribution, \left[\frac{1}{12}, \frac{1}{12}, ..., \frac{1}{12}\right]. (In other words, the number of sunshine hours per month in San Diego is equal in all 12 months of the year.)
Alternative Hypothesis: The number of sunshine hours per month in San Diego is not drawn from the uniform distribution.
As their test statistic, Gina and Abel choose the total variation distance. To simulate samples under the null, they will sample from a categorical distribution with 12 categories — January, February, and so on, through December — each of which have an equal probability of being chosen.
In order to run their hypothesis test, Gina and Abel need a way to calculate their test statistic. Below is an incomplete implementation of a function that computes the TVD between two arrays of length 12, each of which represent a categorical distribution.
def calculate_tvd(dist1, dist2):
return np.mean(np.abs(dist1 - dist2)) * ____
Fill in the blank so that calculate_tvd
works as
intended.
1 / 6
1 / 3
1 / 2
2
3
6
Moving forward, assume that calculate_tvd
works
correctly.
Now, complete the implementation of the function
uniform_test
, which takes in an array
observed_counts
of length 12 containing the number of
sunshine hours each month in a city and returns the p-value for the
hypothesis test stated at the start of the question.
def uniform_test(observed_counts):
# The values in observed_counts are counts, not proportions!
= observed_counts.sum()
total_count = __(b)__
uniform_dist = np.array([])
tvds for i in np.arange(10000):
= __(c)__
simulated = calculate_tvd(simulated, __(d)__)
tvd = np.append(tvds, tvd)
tvds return np.mean(tvds __(e)__ calculate_tvd(uniform_dist, __(f)__))
What goes in blank (b)? (Hint: The function
np.ones(k)
returns an array of length k
in
which all elements are 1
.)
What goes in blank (c)?
np.random.multinomial(12, uniform_dist)
np.random.multinomial(12, uniform_dist) / 12
np.random.multinomial(12, uniform_dist) / total_count
np.random.multinomial(total_count, uniform_dist)
np.random.multinomial(total_count, uniform_dist) / 12
np.random.multinomial(total_count, uniform_dist) / total_count
What goes in blank (d)?
What goes in blank (e)?
>
>=
<
<=
==
!=
What goes in blank (f)?
For this question, we will use data from the 2021 Women’s National Basketball Association (WNBA) season for the next several problems. In basketball, players score points by shooting the ball into a hoop. The team that scores the most points wins the game.
Kelsey Plum, a WNBA player, attended La Jolla Country Day School,
which is adjacent to UCSD’s campus. Her current team is the Las Vegas
Aces (three-letter code 'LVA'
). In 2021, the Las
Vegas Aces played 31 games, and Kelsey Plum played in all
31.
The DataFrame plum
contains her stats for all games the
Las Vegas Aces played in 2021. The first few rows of plum
are shown below (though the full DataFrame has 31 rows, not 5):
Each row in plum
corresponds to a single game. For each
game, we have:
'Date'
(str
), the date on which the game
was played'Opp'
(str
), the three-letter code of the
opponent team'Home'
(bool
), True
if the
game was played in Las Vegas (“home”) and False
if it was
played at the opponent’s arena (“away”)'Won'
(bool
), True
if the Las
Vegas Aces won the game and False
if they lost'PTS'
(int
), the number of points Kelsey
Plum scored in the game'AST'
(int
), the number of assists
(passes) Kelsey Plum made in the game'TOV'
(int
), the number of turnovers
Kelsey Plum made in the game (a turnover is when you lose the ball –
turnovers are bad!) Consider the definition of the function
diff_in_group_means
:
def diff_in_group_means(df, group_col, num_col):
= df.groupby(group_col).mean().get(num_col)
s return s.loc[False] - s.loc[True]
It turns out that Kelsey Plum averages 0.61 more assists in games
that she wins (“winning games”) than in games that she loses (“losing
games”). Fill in the blanks below so that observed_diff
evaluates to -0.61.
= diff_in_group_means(plum, __(a)__, __(b)__) observed_diff
What goes in blank (a)?
What goes in blank (b)?
After observing that Kelsey Plum averages more assists in winning games than in losing games, we become interested in conducting a permutation test for the following hypotheses:
To conduct our permutation test, we place the following code in a
for
-loop.
= plum.get('Won')
won = plum.get('AST')
ast = plum.assign(Won_shuffled=np.random.permutation(won)) \
shuffled =np.random.permutation(ast)) .assign(AST_shuffled
Which of the following options does not compute a valid simulated test statistic for this permutation test?
diff_in_group_means(shuffled, 'Won', 'AST')
diff_in_group_means(shuffled, 'Won', 'AST_shuffled')
diff_in_group_means(shuffled, 'Won_shuffled, 'AST')
diff_in_group_means(shuffled, 'Won_shuffled, 'AST_shuffled')
More than one of these options do not compute a valid simulated test statistic for this permutation test
Suppose we generate 10,000 simulated test statistics, using one of
the valid options from part 1. The empirical distribution of test
statistics, with a red line at observed_diff
, is shown
below.
Roughly one-quarter of the area of the histogram above is to the left of the red line. What is the correct interpretation of this result?
There is roughly a one quarter probability that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution.
The significance level of this hypothesis test is roughly a quarter.
Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging at least 0.61 more assists in wins than losses is roughly a quarter.
Under the assumption that Kelsey Plum’s number of assists in winning games and in losing games come from the same distribution, and that she wins 22 of the 31 games she plays, the chance of her averaging 0.61 more assists in wins than losses is roughly a quarter.
An IKEA fan created an app where people can log the amount of time it
took them to assemble their IKEA furniture. The DataFrame
app_data
has a row for each product build that was logged
on the app. The column 'product'
contains the name of the
product, and the column 'minutes'
contains integer values
representing the number of minutes it took to assemble each product.
You are browsing the IKEA showroom, deciding whether to purchase the
BILLY bookcase or the LOMMARP bookcase. You are concerned about the
amount of time it will take to assemble your new bookcase, so you look
up the assembly times reported in app_data
. Thinking of the
data in app_data
as a random sample of all IKEA purchases,
you want to perform a permutation test to test the following
hypotheses.
Null Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from the same distribution.
Alternative Hypothesis: The assembly time for the BILLY bookcase and the assembly time for the LOMMARP bookcase come from different distributions.
Suppose we query app_data
to keep only the BILLY
bookcases, then average the 'minutes'
column. In addition,
we separately query app_data
to keep only the LOMMARP
bookcases, then average the 'minutes'
column. If the null
hypothesis is true, which of the following statements about these two
averages is correct?
These two averages are the same.
Any difference between these two averages is due to random chance.
Any difference between these two averages cannot be ascribed to random chance alone.
The difference between these averages is statistically significant.
For the permutation test, we’ll use as our test statistic the average assembly time for BILLY bookcases minus the average assembly time for LOMMARP bookcases, in minutes.
Complete the code below to generate one simulated value of the test
statistic in a new way, without using
np.random.permutation
.
= (app_data.get('product') ==
billy 'BILLY Bookcase, white, 31 1/2x11x79 1/2')
= (app_data.get('product') ==
lommarp 'LOMMARP Bookcase, dark blue-green, 25 5/8x78 3/8')
= app_data[billy|lommarp]
billy_lommarp = np.random.choice(billy_lommarp.get('minutes'), billy.sum(), replace=False).mean()
billy_mean = _________
lommarp_mean - lommarp_mean billy_mean
What goes in the blank?
billy_lommarp[lommarp].get('minutes').mean()
np.random.choice(billy_lommarp.get('minutes'), lommarp.sum(), replace=False).mean()
billy_lommarp.get('minutes').mean() - billy_mean
(billy_lommarp.get('minutes').sum() - billy_mean * billy.sum())/lommarp.sum()
The DataFrame apps
contains application data for a
random sample of 1,000 applicants for a particular credit card from the
1990s. The columns are:
"status" (str)
: Whether the credit card application
was approved: "approved"
or "denied"
values
only.
"age" (float)
: The applicant’s age, in years, to the
nearest twelfth of a year.
"income" (float)
: The applicant’s annual income, in
tens of thousands of dollars.
"homeowner" (str)
: Whether the credit card applicant
owns their own home: "yes"
or "no"
values
only.
"dependents" (int)
: The number of dependents, or
individuals that rely on the applicant as a primary source of income,
such as children.
The first few rows of apps
are shown below, though
remember that apps
has 1,000 rows.
In apps
, our sample of 1,000 credit card applications,
applicants who were approved for the credit card have fewer dependents,
on average, than applicants who were denied. The mean number of
dependents for approved applicants is 0.98, versus 1.07 for denied
applicants.
To test whether this difference is purely due to random chance, or whether the distributions of the number of dependents for approved and denied applicants are truly different in the population of all credit card applications, we decide to perform a permutation test.
Consider the incomplete code block below.
def shuffle_status(df):
= np.random.permutation(df.get("status"))
shuffled_status return df.assign(status=shuffled_status).get(["status", "dependents"])
def test_stat(df):
= df.groupby("status").mean().get("dependents")
grouped = grouped.loc["approved"]
approved = grouped.loc["denied"]
denied return __(a)__
= np.array([])
stats for i in np.arange(10000):
= shuffle_status(apps)
shuffled_apps = test_stat(shuffled_apps)
stat = np.append(stats, stat)
stats
= np.count_nonzero(__(b)__) / 10000 p_value
Below are six options for filling in blanks (a) and (b) in the code above.
Blank (a) | Blank (b) | |
---|---|---|
Option 1 | denied - approved |
stats >= test_stat(apps) |
Option 2 | denied - approved |
stats <= test_stat(apps) |
Option 3 | approved - denied |
stats >= test_stat(apps) |
Option 4 | np.abs(denied - approved) |
stats >= test_stat(apps) |
Option 5 | np.abs(denied - approved) |
stats <= test_stat(apps) |
Option 6 | np.abs(approved - denied) |
stats >= test_stat(apps) |
The correct way to fill in the blanks depends on how we choose our null and alternative hypotheses.
Suppose we choose the following pair of hypotheses.
Null Hypothesis: In the population, the number of dependents of approved and denied applicants come from the same distribution.
Alternative Hypothesis: In the population, the number of dependents of approved applicants and denied applicants do not come from the same distribution.
Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.
Option 1
Option 2
Option 3
Option 4
Option 5
Option 6
None of the above.
Now, suppose we choose the following pair of hypotheses.
Null Hypothesis: In the population, the number of dependents of approved and denied applicants come from the same distribution.
Alternative Hypothesis: In the population, the number of dependents of approved applicants is smaller on average than the number of dependents of denied applicants.
Which of the six presented options could correctly fill in blanks (a) and (b) for this pair of hypotheses? Select all that apply.
Option 6 from the start of this question is repeated below.
Blank (a) | Blank (b) | |
---|---|---|
Option 6 | np.abs(approved - denied) |
stats >= test_stat(apps) |
We want to create a new option, Option 7, that replicates the behavior of Option 6, but with blank (a) filled in as shown:
Blank (a) | Blank (b) | |
---|---|---|
Option 7 | approved - denied |
Which expression below could go in blank (b) so that Option 7 is equivalent to Option 6?
np.abs(stats) >= test_stat(apps)
stats >= np.abs(test_stat(apps))
np.abs(stats) >= np.abs(test_stat(apps))
np.abs(stats >= test_stat(apps))
In our implementation of this permutation test, we followed the
procedure outlined in lecture to draw new pairs of samples under the
null hypothesis and compute test statistics — that is, we randomly
assigned each row to a group (approved or denied) by shuffling one of
the columns in apps
, then computed the test statistic on
this random pair of samples.
Let’s now explore an alternative solution to drawing pairs of samples under the null hypothesis and computing test statistics. Here’s the approach:
"dependents"
column as the new “denied” sample, and the values at the at the bottom
of the resulting "dependents"
column as the new “approved”
sample. Note that we don’t necessarily split the DataFrame exactly in
half — the sizes of these new samples depend on the number of “denied”
and “approved” values in the original DataFrame!Once we generate our pair of random samples in this way, we’ll compute the test statistic on the random pair, as usual. Here, we’ll use as our test statistic the difference between the mean number of dependents for denied and approved applicants, in the order denied minus approved.
Fill in the blanks to complete the simulation below.
Hint: np.random.permutation
shouldn’t appear
anywhere in your code.
def shuffle_all(df):
'''Returns a DataFrame with the same rows as df, but reordered.'''
return __(a)__
def fast_stat(df):
# This function does not and should not contain any randomness.
= np.count_nonzero(df.get("status") == "denied")
denied = __(b)__.get("dependents").mean()
mean_denied = __(c)__.get("dependents").mean()
mean_approved return mean_denied - mean_approved
= np.array([])
stats for i in np.arange(10000):
= fast_stat(shuffle_all(apps))
stat = np.append(stats, stat) stats
Researchers from the San Diego Zoo, located within Balboa Park, collected physical measurements of three species of penguins (Adelie, Chinstrap, or Gentoo) in a region of Antarctica. One piece of information they tracked for each of 330 penguins was its mass in grams. The average penguin mass is 4200 grams, and the standard deviation is 840 grams.
We’re interested in investigating the differences between the masses of Adelie penguins and Chinstrap penguins. Specifically, our null hypothesis is that their masses are drawn from the same population distribution, and any observed differences are due to chance only.
Below, we have a snippet of working code for this hypothesis test,
for a specific test statistic. Assume that adelie_chinstrap
is a DataFrame of only Adelie and Chinstrap penguins, with just two
columns – 'species'
and 'mass'
.
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
Which of the following statements best describe the procedure above?
This is a standard hypothesis test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses
This is a standard hypothesis test, and our test statistic is the difference between the expected proportion of Adelie penguins and the proportion of Adelie penguins in our resample
This is a permutation test, and our test statistic is the total variation distance between the distribution of Adelie masses and Chinstrap masses
This is a permutation test, and our test statistic is the difference in the mean Adelie mass and mean Chinstrap mass
For your convenience, we copy the code for the hypothesis test below.
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
What would happen if we removed line (a)
, and replaced
line (b)
with
= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=False) with_shuffled
Select the best answer.
This would still run a valid hypothesis test
This would not run a valid hypothesis test, as all values in the
stats
array would be exactly the same
This would not run a valid hypothesis test, even though there would
be several different values in the stats
array
This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins
For your convenience, we copy the code for the hypothesis test below.
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
What would happen if we removed line (a)
, and replaced
line (b)
with
= adelie_chinstrap.sample(adelie_chinstrap.shape[0], replace=True) with_shuffled
Select the best answer.
This would still run a valid hypothesis test
This would not run a valid hypothesis test, as all values in the
stats
array would be exactly the same
This would not run a valid hypothesis test, even though there would
be several different values in the stats
array
This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins
For your convenience, we copy the code for the hypothesis test below.
= np.array([])
stats = 500
num_reps for i in np.arange(num_reps):
# --- line (a) starts ---
= np.random.permutation(adelie_chinstrap.get('species'))
shuffled # --- line (a) ends ---
# --- line (b) starts ---
= adelie_chinstrap.assign(species=shuffled)
with_shuffled # --- line (b) ends ---
= with_shuffled.groupby('species').mean()
grouped
# --- line (c) starts ---
= grouped.get('mass').iloc[0] - grouped.get('mass').iloc[1]
stat # --- line (c) ends ---
= np.append(stats, stat) stats
What would happen if we replaced line (a)
with
= adelie_chinstrap.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('species')
species )
and replaced line (b) with
= with_shuffled.assign(
with_shuffled =np.random.permutation(adelie_chinstrap.get('mass')
mass )
Select the best answer.
This would still run a valid hypothesis test
This would not run a valid hypothesis test, as all values in the
stats
array would be exactly the same
This would not run a valid hypothesis test, even though there would
be several different values in the stats
array
This would not run a valid hypothesis test, as it would incorporate information about Gentoo penguins
Suppose we run the code for the hypothesis test and see the following empirical distribution for the test statistic. In red is the observed statistic.
Suppose our alternative hypothesis is that Chinstrap penguins weigh more on average than Adelie penguins. Which of the following is closest to the p-value for our hypothesis test?
0
\frac{1}{4}
\frac{1}{3}
\frac{2}{3}
\frac{3}{4}
1
Choose the best tool to answer each of the following questions. Note the following:
Are incomes of applicants with 2 or fewer dependents drawn randomly from the distribution of incomes of all applicants?
Hypothesis Testing
Permutation Testing
Bootstrapping
What is the median income of credit card applicants with 2 or fewer dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping
Are credit card applications approved through a random process in which 50% of applications are approved?
Hypothesis Testing
Permutation Testing
Bootstrapping
Is the median income of applicants with 2 or fewer dependents less than the median income of applicants with 3 or more dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping
What is the difference in median income of applicants with 2 or fewer dependents and applicants with 3 or more dependents?
Hypothesis Testing
Permutation Testing
Bootstrapping