4 - In All Probability Flashcards

1
Q

What does probability deal with?

A

Reasoning in the presence of uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Monty Hall dilemma?

A

A probability problem involving three doors, one hiding a car and two hiding goats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the initial probability of choosing the car behind Door No. 1?

A

One-third

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the host do after you pick a door in the Monty Hall dilemma?

A

Opens another door revealing a goat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What should you do according to Marilyn vos Savant regarding switching doors?

A

Yes; you should switch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the probability of winning if you switch doors?

A

Two-thirds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the probability of winning if you do not switch doors?

A

One-third

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Who was outraged by vos Savant’s answer to the Monty Hall dilemma?

A

Mathematicians and PhDs from American universities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What did Paul Erdős initially believe about switching doors in the Monty Hall dilemma?

A

He believed it made no difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What did Andrew Vázsonyi use to convince Erdős that switching doors was advantageous?

A

A computer program running simulations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main approaches to thinking about probability discussed in the text?

A

Frequentist and Bayesian

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the frequentist approach involve?

A

Dividing the number of times an event occurs by the total number of trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Bayes’s theorem used for?

A

To draw conclusions with mathematical rigor amid uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the prior probability of having a disease if it occurs in 1 in 1,000 people?

A

0.001

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does P(H) represent in Bayes’s theorem?

A

The prior probability of a hypothesis being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does P(E|H) represent in Bayes’s theorem?

A

The probability of the evidence given the hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the posterior probability?

A

The prior probability updated given the evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If a test has a 90% accuracy, what is the probability of having the disease given a positive test result?

A

0.89 percent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens to the posterior probability if the test accuracy increases to 99%?

A

It rises to 0.09 or almost a 1-in-10 chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the significance of Thomas Bayes’s contributions?

A

He laid the foundation for Bayesian probability and statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What happens if the disease becomes more common with the same test accuracy?

A

The probability of having the disease given a positive test rises to 0.5 or 50 percent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the probability that the car is behind Door No. 1 after the host opens Door No. 3?

A

Needs to be calculated using Bayes’s theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Bayes’s theorem formula?

A

P(H|E) = P(E|H) * P(H) / P(E)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is P(E)?

A

The probability of testing positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How do you calculate P(E)?
Sum of probabilities of testing positive from both having and not having the disease
26
What does the term 'sensitivity' refer to in the context of a medical test?
The probability that the test is positive when the subject has the disease
27
What does 'specificity' refer to in the context of a medical test?
The probability that the test is negative when the subject does not have the disease
28
What is the prior probability that the car is behind Door No. 1?
1/3
29
What is the probability that the host opens Door No. 3 if the car is behind Door No. 1?
1/2
30
What is P1 in the context of the probability that the host opens Door No. 3?
P (C1) × P (H3|C1) = 1/3 × 1/2 = 1/6
31
What is the probability that the host opens Door No. 3 if the car is behind Door No. 2?
1
32
What is P2 in the context of the probability that the host opens Door No. 3?
P (C2) × P (H3|C2) = 1/3 × 1 = 1/3
33
What is P3 in the context of the probability that the host opens Door No. 3?
P (C3) × P (H3|C3) = 1/3 × 0 = 0
34
What is the total probability that the host opens Door No. 3?
1/2
35
What should you do after the host opens Door No. 3, revealing a goat?
Switch doors
36
True or False: Most machine learning is inherently deterministic.
False
37
What does the perceptron algorithm find?
A hyperplane that can divide the data
38
What is a random variable?
A number assigned to the outcome of an experiment
39
What type of distribution is a Bernoulli distribution?
It dictates the way values of a discrete random variable are distributed.
40
In a Bernoulli distribution, what is the probability mass function P(X)?
P(X) states that P(X=1) is p and P(X=0) is (1 - p)
41
What is the expected value of a random variable?
The value expected over a large number of trials
42
How is variance calculated?
Sum of (each value of X - expected value of X)² * P(X)
43
What is the standard deviation?
The square root of the variance
44
What shape does the normal distribution have?
A bell-shaped curve
45
What percentage of observed values lie within one standard deviation of the mean in a normal distribution?
68 percent
46
What is the variance in relation to the standard deviation?
Variance is the square of the standard deviation
47
What does a larger standard deviation indicate about the distribution?
A broader, squatter plot
48
What is the mean of the distribution also known as?
Expected value
49
What is the probability of X = 0 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?
0.6
50
What is the probability of X = 1 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?
0.4
51
What does the expected value E(X) represent?
The average outcome of the random variable over many trials
52
Fill in the blank: The theoretical probability of getting heads on a single coin toss is ______.
1/2
53
What does sampling from an underlying distribution help us understand in machine learning?
The representative distribution of the data we have
54
What is the relationship between the number of trials and the expected difference in counts of heads and tails?
On the order of the square root of the total number of trials
55
What is the variance in relation to the standard deviation?
The variance is simply the square of the standard deviation.
56
What does a larger standard deviation indicate about a data distribution?
A larger standard deviation gives you a broader, squatter plot.
57
What characterizes a discrete random variable?
A discrete random variable is characterized by its probability mass function (PMF).
58
What characterizes a continuous random variable?
A continuous random variable is characterized by its probability density function (PDF).
59
Can you determine the probability of a specific value for a continuous random variable?
No, the probability of a specific, infinitely precise value is actually zero.
60
How is the probability that a continuous random variable falls within a range determined?
It is given by the area under the probability density function (PDF) bounded by the endpoints of that range.
61
What is the total area under a probability density function (PDF)?
The total area under the entire PDF equals 1.
62
What parameters are needed for the Bernoulli distribution?
The probability p.
63
What parameters are needed for the normal distribution?
The mean and variance.
64
In supervised learning, what does each instance of data represent?
Each instance of data is a d-dimensional vector.
65
In the context of supervised learning, what does the label y indicate?
y is -1 if the person did not have a heart attack, and 1 if they did.
66
What is the underlying probability distribution denoted as in supervised learning?
P(X, y).
67
What is the Bayes optimal classifier?
It is a classifier that predicts the category with the higher probability based on the underlying distribution.
68
What is maximum likelihood estimation (MLE)?
MLE estimates the best underlying distribution that maximizes the likelihood of observing the data.
69
What is the difference between MLE and MAP?
MLE maximizes P(D | θ), while MAP maximizes P(θ | D).
70
What does MAP stand for?
Maximum a posteriori estimation.
71
What is a common assumption made in Bayesian statistics?
That θ follows a distribution, meaning it is treated as a random variable.
72
What does the term 'prior distribution' refer to in Bayesian statistics?
It refers to the prior belief about the value of θ before observing the data.
73
What is a concrete example of a distribution characterized by parameters?
A Bernoulli distribution characterized by the value p.
74
What is a key feature of the Gaussian distribution?
It is characterized by its mean and variance.
75
What approach is often used when there is no closed-form solution to a maximization problem?
Gradient descent.
76
How do MLE and MAP behave as the amount of sampled data grows?
They begin converging in their estimate of the underlying distribution.
77
Who were the two statisticians that first used Bayesian reasoning for authorship attribution?
Frederick Mosteller and David Wallace.
78
What problem did Mosteller and Wallace tackle using Bayesian reasoning?
The authorship of the disputed Federalist Papers.
79
What was the primary reason for the dispute over the authorship of the Federalist Papers?
Madison and Hamilton did not hurry to enter their claims and became bitter political enemies.
80
What was the outcome of Mosteller and Williams' initial analysis of sentence lengths in the Federalist Papers?
The average lengths for Hamilton and Madison were practically identical, providing little discriminatory power.
81
What statistical measure did Mosteller and Williams calculate to analyze sentence lengths?
Standard deviation (SD).
82
What were the average sentence lengths for Hamilton and Madison?
34.55 and 34.59 respectively
83
What were the standard deviations of sentence lengths for Hamilton and Madison?
19 for Hamilton and 20 for Madison
84
What did Mosteller use as a teaching moment to educate his students on?
The difficulties of applying statistical methods
85
Who collaborated with Mosteller in the mid-1950s to explore Bayesian methods?
David Wallace
86
What did Douglass Adair suggest to Mosteller regarding The Federalist Papers?
To revisit the issue of authorship
87
What type of words did Mosteller and Wallace focus on for their analysis?
Function words
88
How did Mosteller and Wallace initially count the occurrence of function words?
By typing each word on a long paper tape
89
What issue did Mosteller encounter with the computer program used for counting?
It would malfunction after processing about 3000 words
90
What method did Mosteller and Wallace use to calculate authorship probability?
Bayesian analysis
91
What was the outcome of Mosteller and Wallace's analysis regarding the disputed papers?
Overwhelming evidence for Madison's authorship
92
What was the odds for Madison's authorship of paper number 55?
80 to 1
93
What was the significance of Mosteller and Wallace's work according to Patrick Juola?
It was a seminal moment for statisticians and was done objectively
94
What species of penguins were studied in the Palmer Archipelago?
Adélie, Gentoo, and Chinstrap
95
How many attributes were considered for each penguin in the study?
Five attributes
96
What is the function that the ML algorithm needs to learn?
f(x) = y
97
What is the problem with the assumption of linearly separable data?
It may not hold true with more data
98
What does Bayesian decision theory establish?
The bounds for the best predictions given the data
99
What does the histogram of Adélie penguins' bill depth show?
The distribution of bill depths
100
What type of probability is calculated for a specific value of bill depth?
Class-conditional probability
101
What is Bayes's theorem used for in the context of the penguin study?
To calculate the probabilities for each hypothesis
102
What is the prior probability that a penguin is a Gentoo based on the sample?
119/(119+146)
103
What is P(y = Gentoo)?
The prior probability that the penguin is a Gentoo, estimated as 119 / (119 + 146) = 0.45.
104
How is P(x | y = Gentoo) determined?
It is read off from the distribution depicted in the plot, specifically from the Gentoo part.
105
What does P(x) represent?
The probability that the bill has some particular depth, calculated as: * P(x | Adélie) × P(Adélie) * P(x | Gentoo) × P(Gentoo)
106
What is P(y = Gentoo | x)?
The posterior probability that the penguin is a Gentoo, given some bill depth x.
107
What is the Bayes optimal classifier?
A simple classifier using one feature (bill depth) to classify between two types of penguins, Gentoo and Adélie.
108
True or False: The Bayes optimal classifier is the best any ML algorithm can do.
True.
109
What does the term 'posterior probability' refer to?
The probability of a hypothesis after considering the evidence.
110
What limitations exist when estimating underlying distributions in machine learning?
We often do not have access to the true underlying distribution.
111
What are maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation used for?
To approximate underlying distributions from a sample of data.
112
What happens when bill depth is used to distinguish Adélie from Chinstrap penguins?
They are indistinguishable using only bill depth.
113
What additional feature can improve classification between penguin species?
Bill length.
114
What is a probability density function (PDF)?
A function that describes the likelihood of a random variable to take on a particular value.
115
How does increasing the number of features affect the complexity of estimating probability distributions?
It increases the complexity and data requirements for accurate estimation.
116
Fill in the blank: If we have five features, each penguin can be represented as a vector in _______ space.
5D
117
What assumption simplifies the problem of estimating probability distributions in machine learning?
That all features are sampled independently from their own distributions.
118
What is a naïve Bayes classifier?
A classifier that assumes mutually independent features to simplify probability calculations.
119
What is the probability mass function?
A function that gives the probability that a discrete random variable is equal to a specific value.
120
What does D ~ P(X, y) signify?
The data D is sampled from the underlying distribution P(X, y).
121
What is the parameter θ in the context of probability distributions?
The parameters that define the distribution, varying for different types.
122
What is the goal of maximum likelihood estimation (MLE)?
To find the parameter θ that maximizes the likelihood of the data.
123
True or False: The more samples we have, the better the histogram will be in representing the true underlying distribution.
True.
124
What is maximum likelihood estimation (MLE)?
MLE tries to find the θ that maximizes the likelihood of the data, meaning it finds the θ that maximizes P θ (X, y) ## Footnote MLE is a method used in statistics to estimate parameters of a statistical model.
125
What does maximum a posteriori (MAP) estimation assume about θ?
MAP assumes that θ is a random variable and allows for specifying a probability distribution for it ## Footnote MAP incorporates prior beliefs about θ, which is known as the prior.
126
What is the prior in the context of MAP estimation?
The prior is the initial assumption about how θ is distributed ## Footnote For example, assuming a coin is fair or biased before observing any data.
127
What is the relationship between MAP estimation and the posterior probability distribution?
MAP finds the posterior probability distribution P θ (X, y) given the prior and the data ## Footnote The posterior represents updated beliefs about θ after observing the data.
128
What does learning the entire joint probability distribution P θ (X, y) enable?
It enables generating new data that resemble the training data, leading to generative AI ## Footnote This process involves sampling from the learned distribution.
129
What is the naïve Bayes classifier?
It is an algorithm that learns the joint probability distribution with simplifying assumptions and uses Bayes's theorem ## Footnote The naïve Bayes classifier is often used for classification tasks.
130
What is discriminative learning?
Discriminative learning focuses on calculating conditional probabilities of the data belonging to one class or another ## Footnote It contrasts with generative learning, which models the entire data distribution.
131
What does P θ (y | x) represent?
P θ (y | x) represents the probability of the most likely class for a given feature vector x and optimal θ ## Footnote This is used in discriminative learning to make predictions.
132
What is an example of an algorithm that uses discriminative learning?
An example is the nearest neighbor (NN) algorithm ## Footnote The NN algorithm does not make assumptions about the underlying distribution of the data.
133
What kind of boundary does discriminative learning identify?
Discriminative learning identifies a boundary that separates clusters of data points ## Footnote It can be a linear hyperplane or a nonlinear surface.
134
What is the significance of the nearest neighbor (NN) algorithm?
The NN algorithm achieved results nearly as good as the Bayes optimal classifier without underlying distribution assumptions ## Footnote It was developed at Stanford in the 1960s.