General Flashcards

Question

Advantages and disadvantages of SVM

Answer 1

Advantages - it has a regularisation parameter, which makes the user think about avoiding over-fitting. - it uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel and it's linearly separable. - an SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO). - it is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea. Disadvantages - the determination of the parameters for a given value of the regularisation and kernel parameters and choice of kernel. In a way the SVM moves the problem of over-fitting from optimising the parameters to model selection - not great with multiclass

Answer 2

- Have data - Fit simple single-layer decision tree regressor (simple step function - one transition point) - plot out error residuals from first fit - fit single-layer decision tree regressor two to error residuals - combine models one and two for marginally more complex fit (two transition points) for model three - plot out error residuals from fit three - fit single-layer decision tree regressor four to error residuals from fit three - combine - etc

Answer 3

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node. GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow. GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests, but GBDTs are prone to overfitting if not handled carefully.

Answer 4

P(A|B)= P(B|A)P(A) / P(B)

Answer 5

P(B) = P(B|A)P(A) + P(A|B)P(B)

Answer 6

Probability before you run a test

Answer 7

It is the probability of the outcome given the prior and the evidence from the test.

Answer 8

Density is for continuous distributions, mass is for discrete

Answer 9

Continuous

Answer 10

Continuous

Answer 11

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The larger the area under the roc curve, the better the classifier is at separating the classes. https://www.youtube.com/watch?v=OAl6eAyP-yo

Answer 12

So long as the ordering of observations by predicted probability remain the same, the ROC and AUC would be identical invariant of scale. Given all it cares about is how well you've separated your classes, it is only sensitive to rank ordering, so robust to unbalanced classes.

Answer 13

Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists.

Answer 14

Describes how well a response surface fits the available data

Answer 15

It's dominated by outliers

Answer 16

cost function = some data reconstruction error + regularisation penalty

Answer 17

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them.

Answer 18

The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance) k = p_o - p_e / (1 - p_e)

Answer 19

A measure of the mutual dependence between the two variables, it determines how similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y).

Answer 20

The average amount of information produced by a stochastic source of data. The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value. Thus, when the data source has a lower-probability value (i.e., when a low-probability event occurs), the event carries more "information" ("surprisal") than when the source data has a higher-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy.

Answer 21

a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step

Answer 22

Batch uses every data point to calculate each step, so is computationally expensive

Answer 23

Batch uses every data point to calculate each step, so is computationally expensive

Answer 24

- systems performed poorly when they had many items but comparatively few ratings (sparsity) - computing similarities between all pairs of users was expensive - user profiles changed quickly and the entire system model had to be recomputed

Answer 25

- Scaling features arbitrarily (for example converting height in feet to inches) skews the PCs considerably - Depending on task, the strongly predictive information may lie in directions of small variance, which gets removed by PCA - Assumes underlying Gaussian distribution of data

Answer 26

measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations.

Answer 27

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. (L1 loss)

Answer 28

RMSE is the square root of the average of squared differences between prediction and actual observation.

Answer 29

Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable. RMSE does not necessarily increase with the variance of the errors. RMSE increases with the variance of the frequency distribution of error magnitudes. RMSE has a tendency to be increasingly larger than MAE as the test sample size increases.

Answer 30

When solving the statistical linear regression problem, a very common modeling assumption is that for every possible value of "x", the quantity "y" is normally distributed with a mean that is linear in "x". Therefore, the likelihood function is essentially a product of PDFs of the normal distribution. As stated above, you estimate the unknown parameters (and therefore find the best fitting line) by maximizing the likelihood function. If you look at what the product of normal PDFs looks like, you will notice that maximizing this expression happens to be equivalent to minimizing the sum of squared errors.

Answer 31

the percentage of the response variable variation that is explained by a linear model. R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data.

Answer 32

S(u, i) = rhat_u + SUM((r_vi - rhat_v) . w_uv) / SUM(w_uv) ``` S -> pred score u -> user v -> different user i -> item r-> rating (want to normalise to user v's range, for user u) w -> weighting describing similarity between users u & v. Commonly Pearson's correlation or cosine similarity. hat -> average note here SUM is over (v e U) ```

Answer 33

obvs v =/= u limit size of neighbourhood (top n neighbours) Limit minimum similarity between people, i.e. SUM is over (v e V)

Answer 34

- works quite well - efficient implementation in the common case where you have many more users than items - un-dynamic relationships between items lends itself to precomputability

Answer 35

item-item relationships need to be stable (most issues are temporal) resulting recommendations have lower serendipity

Answer 36

S(u, i) = SUM(w_ij . r_uj) / SUM(|w_ij|) ``` S -> pred score u -> user i -> item j -> other item r-> rating (want to normalise to user v's range, for user u) w -> weighting describing similarity between users u & v. Commonly Pearson's correlation or cosine similarity. hat -> average note here SUM is over (j e N) N -> Neighbourhood ```

Answer 37

a weight between 0 & 1 that controls the contribution of each tree For example, if the current prediction for a particular example is 0.2 and the next tree predicts that it should actually be 0.8, the correction would be +0.6. At a learning rate of 1, the updated prediction would be the full 0.2+1(0.6)=0.8, while a learning rate of 0.1 would update the prediction to be 0.2+0.1(0.6)=0.26.

Answer 38

The maximum number of edges from the node to the tree's root node

Answer 39

embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Answer 40

If the max/min exists for the linear programming problem, it occurs at the vertex of the feasible region

Answer 41

- Define the variables - Write the objective function and state whether the goal is to minimise or maximise the function - Write the constraints (which gives you the system of inequalities) - Graph the constraints to determine the feasible region (the solution to the system of inequalities) - Identify the vertices of the feasible region - Test the vertices in the objective function to determine the maximum or minimum function value.

Answer 42

You lose the sense of order. I.e. if you bin columns 1-10, into 1-5 then 6-10, you lose the concept that the values 1-5 are smaller than 6-10.

Answer 43

grep -rn “string” ~/location/

Answer 44

Curl is "a command line tool for getting or sending files using URL syntax."

Answer 45

GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project

Answer 46

Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance). Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages. The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

Answer 47

maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

Answer 48

that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Answer 49

In frequentist inference, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play a key role in frequentist inference, especially methods of estimating a parameter from a set of statistics. Probability in this mathematical context describes the plausibility of a random outcome, given a model parameter value, without reference to any observed data. Likelihood describes the plausibility of a model parameter value, given specific observed data. Let X be a discrete random variable with probability mass function p depending on a parameter θ. Then the function L(θ∣x) = p_θ(x) = P_θ (X = x) considered as a function of θ, is the likelihood function (of θ), given the outcome x of the random variable X. Sometimes the probability of "the value x of X for the parameter value θ" is written as P(X = x | θ); it is often written as P(X = x; θ), to emphasise that it is not a conditional probability.

Answer 50

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

Answer 51

A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton & Sejnowski, 1983) that allows them to discover interesting features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in "restricted Boltzmann machines" that have a single layer of feature detectors. Many hidden layers can be learned efficiently by composing restricted Boltzmann machines, using the feature activations of one as the training data for the next. Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent a cost function. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that have low values of the cost function. For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must learn to generate these vectors with high probability. To do this, it must find weights on the connections so that, relative to other possible binary vectors, the data vectors have low values of the cost function. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems.

Answer 52

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.

Answer 53

a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Answer 54

a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer. Backpropagation is a special case of a more general technique called automatic differentiation. In the context of learning, backpropagation is commonly used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function. This technique is also sometimes called backward propagation of errors, because the error is calculated at the output and distributed back through the network layers. Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically (but not necessarily) means that a desired target value is known. For this reason it is considered to be a supervised learning method, although it is used in some unsupervised networks such as autoencoders. Backpropagation is also a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. It is closely related to the Gauss–Newton algorithm, and is part of continuing research in neural backpropagation. Backpropagation can be used with any gradient-based optimizer, such as L-BFGS or truncated Newton

Answer 55

n mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation,[1][2] is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

Answer 56

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents. For example, the expected value in rolling a six-sided die is 3.5, because the average of all the numbers that come up in an extremely large number of rolls is close to 3.5. Less roughly, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity. More practically, the expected value of a discrete random variable is the probability-weighted average of all possible values. In other words, each possible value the random variable can assume is multiplied by its probability of occurring, and the resulting products are summed to produce the expected value. The same principle applies to an absolutely continuous random variable, except that an integral of the variable with respect to its probability density replaces the sum The expected value is a key aspect of how one characterizes a probability distribution; it is one type of location parameter. By contrast, the variance is a measure of dispersion of the possible values of the random variable around the expected value. The variance itself is defined in terms of two expectations: it is the expected value of the squared deviation of the variable's value from the variable's expected value.

Answer 57

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p.

Answer 58

A distribution of random functions

Answer 59

Given random variables X, Y, ..., that are defined on a probability space, the joint probability distribution for X, Y, ... is a probability distribution that gives the probability that each of X, Y, ... falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

Answer 60

I n probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. In the case of a continuous distribution, it gives the area under the probability density function from minus infinity to x. Cumulative distribution functions are also used to specify the distribution of multivariate random variables.

Answer 61

In the context of learning, say you have a classification problem with data set {(X1,Y1),...,(Xn,Yn)}, where Xn are your features and Yn are your true labels. Given a hypothesis function h(x); the loss function l:(h(Xn),Yn)→ℝ takes your hypothesis functions prediction, i.e. h(Xn) as well as the true label for that particular input and returns a penalty. Now, a general goal is to find a hypothesis such that it minimizes the empirical risk (the chances of being wrong!): Rl(h) = E_empirical[l(h(X),Y)] = 1/m ∑^m_i l(h(X_i,Y_i). In the case of binary classification, a common loss function that is used is the 0−1 loss function: l(h(X),Y) = {0 Y=h(X) 1 otherwise In general the loss function that we care about cannot be optimized efficiently. For example, 0−1 loss function is discontinuous. So, we consider another loss function that will make our life easier, which we call it the surrogate loss function. An example of a surrogate loss function could be ψ(h(x))=max(1−h(x),0) (hinge-loss in SVM), which is convex and easy to optimize using conventional methods. This function acts as a proxy, for the actual loss we wanted to minimize in the first place. Obviously, it has its disadvantages, but in some cases a surrogate loss function actually results in being able to learn more. By this I mean that, once your classifier achieves optimal risk (i.e. highest accuracy), you can still see the loss decreasing, which means that it is trying to push the different classes even further apart to improve its robustness.

Answer 62

linear models Y_i ~N(mu_i, sigma^2) | GLMs Y_i ~ exponential

Answer 63

linear models nu_i = alpha + beta x_i | GLMs e.g. nu_i = alpha + beta x_i + gamma x_i^2

Answer 64

linear models nu_i = mu_i -> mu_i = alpha + beta x_i | GLM e.g. nu_i = ln(mu_i) -> mu_i = e ^ { alpha + beta x_i + gamma x_i^2}

Answer 65

provides the relationship between the linear predictor and the mean of the distribution function

General Flashcards

(89 cards)