Interpreting Data with Statistical Models Flashcards Preview

DP-100 - PS > Interpreting Data with Statistical Models > Flashcards

Flashcards in Interpreting Data with Statistical Models Deck (10)
Loading flashcards...
1

A research group analyzed a sample of 30 users. They discovered that switching to a probabilistic algorithm on an e-commerce platform traduced in the following confidence interval for the increase of sells, with 90% confidence: [25%,35%] . Which of the following is true?

If one increases the confidence, the interval is more accurate.

90% of the possible samples of size 30 will have an average increase in sales between 25% and 35%.

The average increase in sales of the sample is between 25% and 35%.

90% of the users bought between 25% and 35% more.

90% of the possible samples of size 30 will have an average increase in sales between 25% and 35%.

2

A study made by students of “Probability and statistics” consisted of throwing a dice 30 times and counting the number of times it landed on each side. Given the following Chi-square analysis, what would you conclude with a significance of 5%?


PICTURE...

Although the p value is less than the significance, as the Chi-square is not well approximated by the Chi-square distribution, then we cannot say the dice is balanced.

Given that the p value is less than the significance, we can say that the dice is balanced.

Although the p value is less than the significance, as the Chi-square is not well approximated by the Chi-square distribution, then we cannot say the dice is not balanced.

Given that the p value is less than the significance, we can reject that the dice is balanced.

Although the p value is less than the significance, as the Chi-square is not well approximated by the Chi-square distribution, then we cannot say the dice is not balanced.

3

You want to understand which type of protein mix is better for farm pigs as measured by weight gain in kg. You have three different types of mix. Which statistical test is optimal?

1 vs 1 T-tests of all possible combinations of protein mix comparisons

Chi-square of expected gained weight against the observed

T-test of the three mix types

One way ANOVA with the protein mix as an explanatory variable

One way ANOVA with the protein mix as an explanatory variable

4

Why is homoscedasticity such an important assumption in ANOVA?

Because without homoscedasticity, the F statistics are not well approximated by the F distribution

Because without homoscedasticity, we do not have normality of the samples

Because without homoscedasticity, we cannot infer from differences in variabilities to have a difference in means

Because without homoscedasticity, we cannot infer from differences in variabilities to have a difference in means

5

A couple is trying to understand household income per state. They propose an ANOVA experiment with 20 random families from different states. They believed California would have a greater income, but found a simple planned contrast that compares 1-1 California income vs. the rest of the states. In all these comparisons, California had significative more income. On a different planned contrast, the combination of New York and California resulted in a larger average income in average than any other combination of states.

One suggests that as both contrasts have more power than ANOVA, using both results they can conclude that household income order is: California > New York > Rest of the states.

Would you agree?

No, because those contrasts do not align with their previous hypothesis

No, because difference contrasts need to be validated later with Tukey

No, because combining both would lead to an increase in the type 1 error probability

No, because the population of California is significantly larger than the rest of the states, which causes a biased sample

No, because combining both would lead to an increase in the type 1 error probability

6

In an experiment comparing algorithms, you get the following output. As an external help, we know that the sample average of CNN was the biggest, followed by SRU and finally LR.

What would you conclude?

[PICTURE]

We can safely assume that CNN is the best algorithm, as it is different fro, LR and the average is bigger than SRU

We need more data to decrease the confidence intervals of each algorithm and find significance

We can choose randomly between SRU or CNN

We need more data to decrease the confidence intervals of each algorithm and find significance

7

As the linear regression model Null Hypothesis' is regarding the linear component while ANOVA Null Hypothesis is regarding the constant component, which test would you expect to have higher statistical power?

ANOVA

They are the same

Cannot be predicted as they validate different things

Linear Regression

Linear Regression

8

On which case can we assume there is homoscedasticity and a high R2?

[PICTURE]

e

b

a

g

a

9

When plotting the residues against a linearized fitted model, we should do which of the following?

Take the residue of the transformed variables, as those are being tested in the linear model.

Take the residue of the original variables, because we care about those.

Take the absolute residue of the transformed variables for making a Levene test.

Take the residue of the transformed variables, as those are being tested in the linear model.

10

A biologist selects a random sample of 50 pines from a forest in California and measures the diameter at breast height (DBH), obtaining an average value of 21 cm. The biologist affirms that this will be the average DBH of all the pines of said forest. Would you agree with this affirmation?

No, because the average changes over samples. This would only be true if the variance of the sample was 0.

Yes, because the average is a center statistic and we can safely assume it distributed normally with mean 21 cm.

Yes, because the average of the samples is the same as the mean.

No, because the average changes over samples. To estimate the actual mean, you would need the average of all possible averages.

No, because the average changes over samples. To estimate the actual mean, you would need the average of all possible averages.