Statistics Flashcards
Explain the balance between exploration and exploitation in clinical trials
You want to explore new types of medicine, but you have to make a trade-off with exploiting, that is, trying to make the patient well on methods you already know work.
Explain the epsilon greedy approach in a multi armed bandit algorithm
Select action with highest expected return (exploitation) but have some probability epsilon to rather select a random action (exploration)
What is softmax exploration (in the multi armed bandit algorithm)?
The probability of selecting an action is proportional to the expected return of an action (such that with two similar, highly rated actions, both will be selected given some time)
Explain the following term in the Upper Confidence Bound algoritm (from multi armed bandit), with respect to exploration/exploitation:
max_a{Q(a) + sqrt(2*log t / N(a))}
You want to maximize the expected value of an action Q(a) (exploit) while also maximize the sqrt term (explore), which will increase as long as the action is not selected
What is the difference between probability mass function and probability density function?
PMF is for discrete variables, PDF is for continuous variables.
What is the difference between a binomial and a bernoulli distribution?
The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution)
What is sequential decision making?
In artificial intelligence, sequential decision making refers to algorithms that take the dynamics of the world into consideration, thus delay parts of the problem until it must be solved.
What is the difference between a multi armed bandit and the contextual bandit?
The contextual bandit also uses information about the state of the environment. For example, my Quora feed may be using some kind of contextual bandits, where both exploration and exploitation is used to provide high quality content (M.A.B.) but also the contextual information (about me, the user).
What three criteria are met in a Poisson Process?
- Events are independent of each other. The occurrence of one event does not affect the probability another event will occur.
- The average rate (events per time period) is constant.
- Two events cannot occur at the same time.
What is the name of a Python library that offers a high level interface for MCMC algorithms
PYMC3
Can Gaussian Processes be used for tuning neural network hyperparams?
Yes. Acquiring the optimal hyperparams is expensive, and using a Gaussian Process can be much more effective than a brute force grid search.
How much CPU does a Gaussian Process with d dimensions/features and n number of training samples use?
O(n³) + O(dn²). The first term is for the inversion; the second term for the prediction.
If you run a relay stage close to the average pace of the runners in the stage, and you observe the distribution of the other runners, what distribution will you observe? (Hint: Inspection paradox)
A bimodal distribution. The true distribution may peak around your pace, but since you are running in this pace, you will observe most of these runners. And since you run in a relay, your initial ranking is random, that is, the best runners are not necessarily in front of you.
What is the difference between residuals and error?
The errors are the deviations of the observations from the (true) population mean, while the residuals are the deviations of the observations from the sample mean. The sum of the residuals are (by definition) 0, this is generally not the case for errors.
With increasing variance in the distribution, should you focus more (or less) on exploration vs exploitation (in Thompson sampling)
More on exploration.
What is the Poisson distribution?
A discrete PMF expressing the probability of a given number of events in a fixed interval of time. For example, the number of mails received per day may (if conditions are satisfied) have a Poisson distribution
What does d_1, … , d_n ~poisson(theta) say?
That each sample d_i follows a Poisson distribution, with expected value theta
Conjugacy is required to do exact Bayesian inference. What does this imply?
That the posterior distribution p(theta|x) is in the same family (such as gamma) as the prior distribution p(theta). This happens when the prior is a “conjugate prior” for the likelihood function p(x|theta). For example, the Beta distribution is a conjugate prior to the binomial and bernoulli distributions.
What is random forest?
An algorithm where you split your data into multiple decision trees, where the decision trees are a subset of your original data. During prediction, let all your “subset decision trees” make a decision, and select the option that most trees predicted.
How does the decision tree algorithm work?
It’s a divide & conquer approach with respect to the features in the data. Assume that we have three categorical features A, B, C, and output 0 or 1. First, split data into all possible values of feature A. If data in one subset only has output 1 (or only 0), it is a “pure subset”, and you can stop. If not, continue splitting with feature B and so on, until you are left with only pure subsets.
What is pruning (in a decision tree setting)
Construct an overfitted tree, and then delete leaves based on selected criteria.
Explain the difference between bagging and boosting
Bagging: you create an ensemble of predictions (with replacement, such that each dataset is different), where you output the average (or some other measure) of the predictions. This reduces the variance, and handles overfitting.
Boosting: you do a prediction on parts of the training dataset, and sequentially, you select the points with high error for the next dataset to train. Hence, you train selectively on difficult parts of the dataset. This can reduce bias and variance, but you may also overfit.
What is meant by additive modelling? (in gradient boosting terminology)
When modelling a complex function, you can add several simple functions (e.g. 30, 2x, sin x) to model something very complex.
What are weak learners (in gradient boosting terminology)
That you add simple models (i. e. weak learners) to learn complicated functions.