Statistics Flashcards by Tarjei Bondevik

Explain the balance between exploration and exploitation in clinical trials

You want to explore new types of medicine, but you have to make a trade-off with exploiting, that is, trying to make the patient well on methods you already know work.

How well did you know this?

Not at all

Perfectly

Explain the epsilon greedy approach in a multi armed bandit algorithm

Select action with highest expected return (exploitation) but have some probability epsilon to rather select a random action (exploration)

How well did you know this?

Not at all

Perfectly

What is softmax exploration (in the multi armed bandit algorithm)?

The probability of selecting an action is proportional to the expected return of an action (such that with two similar, highly rated actions, both will be selected given some time)

How well did you know this?

Not at all

Perfectly

Explain the following term in the Upper Confidence Bound algoritm (from multi armed bandit), with respect to exploration/exploitation:

max_a{Q(a) + sqrt(2*log t / N(a))}

You want to maximize the expected value of an action Q(a) (exploit) while also maximize the sqrt term (explore), which will increase as long as the action is not selected

How well did you know this?

Not at all

Perfectly

What is the difference between probability mass function and probability density function?

PMF is for discrete variables, PDF is for continuous variables.

How well did you know this?

Not at all

Perfectly

What is the difference between a binomial and a bernoulli distribution?

The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution)

How well did you know this?

Not at all

Perfectly

What is sequential decision making?

In artificial intelligence, sequential decision making refers to algorithms that take the dynamics of the world into consideration, thus delay parts of the problem until it must be solved.

How well did you know this?

Not at all

Perfectly

What is the difference between a multi armed bandit and the contextual bandit?

The contextual bandit also uses information about the state of the environment. For example, my Quora feed may be using some kind of contextual bandits, where both exploration and exploitation is used to provide high quality content (M.A.B.) but also the contextual information (about me, the user).

How well did you know this?

Not at all

Perfectly

What three criteria are met in a Poisson Process?

Events are independent of each other. The occurrence of one event does not affect the probability another event will occur.
The average rate (events per time period) is constant.
Two events cannot occur at the same time.

How well did you know this?

Not at all

Perfectly

What is the name of a Python library that offers a high level interface for MCMC algorithms

PYMC3

How well did you know this?

Not at all

Perfectly

Can Gaussian Processes be used for tuning neural network hyperparams?

Yes. Acquiring the optimal hyperparams is expensive, and using a Gaussian Process can be much more effective than a brute force grid search.

How well did you know this?

Not at all

Perfectly

How much CPU does a Gaussian Process with d dimensions/features and n number of training samples use?

O(n³) + O(dn²). The first term is for the inversion; the second term for the prediction.

How well did you know this?

Not at all

Perfectly

If you run a relay stage close to the average pace of the runners in the stage, and you observe the distribution of the other runners, what distribution will you observe? (Hint: Inspection paradox)

A bimodal distribution. The true distribution may peak around your pace, but since you are running in this pace, you will observe most of these runners. And since you run in a relay, your initial ranking is random, that is, the best runners are not necessarily in front of you.

How well did you know this?

Not at all

Perfectly

What is the difference between residuals and error?

The errors are the deviations of the observations from the (true) population mean, while the residuals are the deviations of the observations from the sample mean. The sum of the residuals are (by definition) 0, this is generally not the case for errors.

How well did you know this?

Not at all

Perfectly

With increasing variance in the distribution, should you focus more (or less) on exploration vs exploitation (in Thompson sampling)

More on exploration.

How well did you know this?

Not at all

Perfectly

What is the Poisson distribution?

A discrete PMF expressing the probability of a given number of events in a fixed interval of time. For example, the number of mails received per day may (if conditions are satisfied) have a Poisson distribution

How well did you know this?

Not at all

Perfectly

What does d_1, … , d_n ~poisson(theta) say?

That each sample d_i follows a Poisson distribution, with expected value theta

Conjugacy is required to do exact Bayesian inference. What does this imply?

That the posterior distribution p(theta|x) is in the same family (such as gamma) as the prior distribution p(theta). This happens when the prior is a “conjugate prior” for the likelihood function p(x|theta). For example, the Beta distribution is a conjugate prior to the binomial and bernoulli distributions.

What is random forest?

An algorithm where you split your data into multiple decision trees, where the decision trees are a subset of your original data. During prediction, let all your “subset decision trees” make a decision, and select the option that most trees predicted.

How does the decision tree algorithm work?

It’s a divide & conquer approach with respect to the features in the data. Assume that we have three categorical features A, B, C, and output 0 or 1. First, split data into all possible values of feature A. If data in one subset only has output 1 (or only 0), it is a “pure subset”, and you can stop. If not, continue splitting with feature B and so on, until you are left with only pure subsets.

What is pruning (in a decision tree setting)

Construct an overfitted tree, and then delete leaves based on selected criteria.

Explain the difference between bagging and boosting

Bagging: you create an ensemble of predictions (with replacement, such that each dataset is different), where you output the average (or some other measure) of the predictions. This reduces the variance, and handles overfitting.

Boosting: you do a prediction on parts of the training dataset, and sequentially, you select the points with high error for the next dataset to train. Hence, you train selectively on difficult parts of the dataset. This can reduce bias and variance, but you may also overfit.

What is meant by additive modelling? (in gradient boosting terminology)

When modelling a complex function, you can add several simple functions (e.g. 30, 2x, sin x) to model something very complex.

What are weak learners (in gradient boosting terminology)

That you add simple models (i. e. weak learners) to learn complicated functions.

Which quantity is the best approximation of the L1 loss? (sigma | y_i - ??? | )

The median (and not the mean, which is the best quantity of the L2 loss (i.e. the mean squared error)

What is the MAE?

The mean absolute error, defined as MAE = sigma ( |e_i| / n )

Explain the histogram-based methods used on features in lightgbm

To reduce training time, features are aggregated into bins, such that the computational cost becomes O(n_bins*n_data), which - depending on the bin size - is much less than O(n_features*n_data). However, increasing the bin size reduces the accuracy, and finding the optimal is very complex.

Name four methods lightgbm utilizes to speed up the rate determining splitting step in its algorithm

1. Histogram (binning features together) 2. Ignoring sparse inputs (i.e. ignoring zeros before calculating optimal splitting) 3. Subsampling (by assuming low gradient data points are well trained, and does not need as much focus) 4. Feature bundling (not exactly sure, but something about merging together features with certain behaviors)

What does the L1 and L2 loss optimize?

The mean absolute error, and the mean square error, respectively. (Or similarly, the least absolute deviations, and the least square errors, respectively).

How does gradient boosting machines (GBM) perform gradient descent, when you use MSE as loss function?

The derivative of MSE is (y_i - y_bar), i.e. the distance between the predicted value and the exact values, i.e. the residuals, which is what we optimize for in GBM.

Explain the difference between using gradient descent on a neural network, and gradient boosting machines (GBM)

When using gradient descent on a neural network, we use gradient descent to update the weights based on the training data. In GBM, we use gradient descent to add models based on the predicted results from our current model: it is helpful to think that in GBM, we are sweeping prediction space.

What is the confusion matrix?

A metric to evaluate the accuracy of a classification, with true negatives/positives on the diagonal, and false negatives/positives on off-diagonal elements.

When does it make sense to log transform data?

If you are interested in relative differences. For example, a stock price increase from 1 to 1.1 or 100 to 110 both have a relative increase of 10%. These relative differences are captured by the log transform: lg 1.1 - lg 1.0 = lg (1.1/1.0) = lg(110/100) = lg 110 - lg 100. Without the log transform, the increase from 1 to 1.1 is poorly captured by the model.

Why may ML algorithms perform worse if one applies ttoo many features?

Then there's a risk that the algorithm is simply fitting noise, i.e. that it overfits.

Explain the filter method within feature selection, and when to use it.

The filter method is to a) drop features that have low correlation with the output variable, and subsequently, b) drop remaining features that are correlated with each other (that is, keep only one of them if two are correlated). This method is extremely fast and may be good if you have many features. However, it is not very accurate.

Explain the backward elimination method, which is a subset of the wrapper methods within feature selection.

a) Evaluate model with all features. b) Remove the worse performing feature and re-evaluate model, hopefully this will improve accuracy. c) Continue until accuracy stops improving. This method is relatively accurate, but may come with a large computational cost.

What is the n_estimators parameter in lightgbm, and what can happen if it's too large?

It means the number of boosted rounds, essentially the number of rounds one trains on the error in the decision tree. If it becomes too large, the model will overfit (imagine n_estimators being 10⁶ without proper stopping criteria: then the model would obviously overfit)

How can you detect overfitting with k fold cross validation?

If you have a difference in the error of the different folds (i.e. large variance in the "k fold dataset"), it may be an indication of overfitting.

Given two causual possibilities, 1. A -> B: P(A, B) = P(A)*P(B|A) 2. B-> A: P(A, B) = P(B)*P(A|B) and assume that A->B is the causal effect. When applying a trained model on the transfer data, assuming the marginal probabilities are slightly different in the transfer data, why do you need fewer samples to get low regret on model 1?

If A->B is the causal relation, P(B|A) will stay constant even if you change P(A). If e.g. A is a discrete variable with 10 possibilities, you will now have a model with 10 parameters (roughly). In the second case, both P(B) and P(A|B) will change if you change P(A), hence you get a model with 10x10 parameters. In such a case, you need more transfer data to get a good model with low regret.

If you how infinite computational power, and know all relevant parameters in your system, how can you always find the causal relation relations?

By trying all possible causal models on the transfer data, and observe where you get the fastest learning rate.