Machine Learning Flashcards

Question

Cost Function

Answer 1

A measurement of the cost to the performance task of making a prediction Y' when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications. Example: "Squared error function".

Answer 2

A method for estimating the accuracy of an inducer by dividing the data into k mutually exclusive subsets, or folds, of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.

Answer 3

A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)

Answer 4

each node in the tree tests an attribute, decisions are represented by the edges leading away from that node with leaf nodes representing the final decision for all instances that reach that leaf node

Answer 5

1) have a specific edge for no value; 2) track the number of instances that follow each path and assign that instance to the most popular edge; 3) give instances weights and split the instance evenly down each possible edge. Once the value has reached the leaf nodes, recombine it using the weights

Answer 6

since decision trees are often created by continuously splitting the instances until there is no more information gain, it is possible that some splits were done with too little information gain and result in overfitting of the model to the training data. Basic idea is to check child nodes to see if combining them with their parent would result in an increase in entropy below some threshold

Answer 7

because only a single attribute is tested at a node, it is possible that two copies of the same subtree will need to be placed in a tree if the attributes from that subtree were not tested in the root node

Answer 8

if a node has edges for each possible value of that nominal value, then that value will not be tested again further down the tree. If a node groups the values into subsets with an edge per subset, then that attribute may be tested again

Answer 9

can be tested for less than, greater than, equal to, within some range. Can result in just two edges or multiple edges. Can also have an edge that represents no value. May be tested multiple times in a single path

Answer 10

Refers to a class of methods that includes neural networks and deep belief nets. Useful for finding a hierarchy of the most significant features, characteristics, and explanatory variables in complex data sets. Particularly useful in unsupervised machine learning of large unlabeled datasets. The goal is to learn multiple layers of abstraction for some data. For example, to recognize images, we might want to first examine the pixels and recognize edges; then examine the edges and recognize contours; examine the contours to find shapes; the shapes to find objects; and so on.

Answer 11

Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.

Answer 12

used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)

Answer 13

different features will have different effects on the accuracy of classifications. A scaling factor can be used for each feature to reduce its effect on the classification, possibly to 0. These scales can be optimized to not only improve classifications, but to also show the relative importance of the features.

Answer 14

Mean normalization of data to speed up gradient descend

Answer 15

1: measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.

Answer 16

An iterative optimization algorithm for finding a local minimum. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of "steepest descent"). Also known as steepest descent.

Answer 17

they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.

Answer 18

The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing

Answer 19

A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts ("left" and "right" or maybe "above" and "below"). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn't divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.

Answer 20

replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function

Answer 21

gives a single number from two vectors by multiplying each value in the first vector by the corresponding value in the second vector and adding them all together

Answer 22

Binary classification problems arise when we seek to separate two sets of data points in R^n, each corresponding to a given class. We seek to separate the two data sets using simple ''boundaries'', typically hyperplanes. Once the boundary is found, we can use it to predict the class a new point belongs to, by simply checking on which side of the boundary it falls.

Answer 23

Trying to fit a linear continuous function to the data. Univariate or Multivariate.

Answer 24

A kind of regression analysis often used when the dependent variable is dichotomous and scored 0 or 1. It is usually used for predicting whether something will happen or not, such as graduation, business failure, or heart attack-anything that can be expressed as event/non-event. Independent variables may be categorical or continuous in logistic regression analysis.

Answer 25

Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling methods that have highly structured model forms, such as linear regression, logistic regression and discriminant analysis are generally not considered part of machine learning. Unsupervised learning methods such as association rules and clustering are also considered part of machine learning.

Answer 26

A computer program is said to learn from from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Answer 27

The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)

Answer 28

simplest way of expressing output from machine learning, cells in table represent the resulting decision based on the row and column which represent the conditions

Answer 29

Define Test error as summed error for classification (false prediction)

Answer 30

Negative means not having the symptoms

Answer 31

presence of something we are looking for - 1

Answer 32

Of all predicted positives, how many are positive?

Answer 33

Of all positives how many were predicted as positive?

Answer 34

Hypotheses correctly predicts negative output.

Answer 35

Hypotheses correctly predicts positive output.

Answer 36

algorithm that automatically selects a good model function for a dataset.

Answer 37

Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)

Answer 38

Multiclass classification. Reducing a classification problem with multiple features that have to be predicted to a simple classification problem by looking at one feature at a time. Then to determin the final prediction we take the max of all the predicted values.

Answer 39

a field of computer science involved with interactions between computers and human languages

Answer 40

viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.

Answer 41

The science of describing and, especially, visualizing the connections among objects. The objects might be human, biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977 network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines representing connections).

Answer 42

Interconnected neural cells. With experience, networks can learn, as feedback strengthens or inhibits connections that produce certain results. Computer simulations of neural networks show analogous learning.

Answer 43

In neuronal networks the process of calculating the subsequent layers of the network. Each layer depends on the calculations done on the layer before it.

Answer 44

Symmetry breaking for neural networks is achieved by:

Answer 45

starts with a set of random solutions called the "population". Costs are calculated for all these solutions and the best solutions are saved (called "elitism"). The other solutions are discarded and are replaced using two methods. The first is "mutation" where simple random changes are made to the elite solutions. The second is "crossover" or "breeding" where two elite solutions are combined, such as by replacing random values in one with values in the other. This is repeated either for a set number of iterations or until several iterations have passed with no improvement.

Answer 46

starts with a random solution and looks at the set of neighboring solutions for those that are better, this is repeated continuously until no neighboring solution has a lower cost. Major drawback is that it will get stuck in a local minimum

Answer 47

difference between the current entropy and the weighted-average entropy of the two new groups. Used when deciding which attribute to use when splitting instances in a decision tree

Answer 48

running a hill climbing algorithm several times using new random initial input to attempt to reach the global minimum

Answer 49

similar to hill climbing, but adds a new variable "temperature" which starts high and gradually cools. With each iteration one variable is randomly changed in a certain direction. The new cost is compared to the old cost and is accepted if the new cost is lower. If the new cost is higher, it may be accepted based on a probability that is computed from the current "temperature". As the temperature cools, it gets less and less likely that the worse solution would be accepted.

Answer 50

when there is no clear slope in the change in cost function as the values are varied. This is because these algorithms all attempt to follow that slope toward a global minimum.

Answer 51

When the model describes an error or noise instead of the actual model. A problem that occurs when data-mining is used to create overly complex models that are not suited to making accurate predictions. "High variance". If we have too many features the learned hypothesis may fit the training set very well, but fails to generalzie to new examples.

Answer 52

A decision tree classifier that produces a "forest of trees", yielding highly accurate models, essentially by iteratively randomizing one input variable at a time in order to learn if this randomization process actually produces a less accurate classifier. If it doesn't, then that variable is ousted from the model.

Answer 53

Gathers information (e.g., demographics, genre, keywords, preferences, survey responses) to generate a profile for each user or item. Users are matched to items based on their profiles. Example: Pandora's Music Genome Project.

Answer 54

Based on past user behavior. Each user's history of behaviors (ratings, purchases, or viewing history) is used to make associations between users with similar behavior and between items of interest to the same users. Example: Netflix. Methods: 1. Neighborhood-based methods, based on user-user or item-item distances; 2. Latent factor or reduced- dimension models, which automatically discover a small number of descriptive factors for users and items; 3. Low-rank matrix factorization is the best-known example of reduced-dimension models and is among the most flexible and successful methods underlying recommendation systems. There are many variants of matrix factorization, including probabilistic and Bayesian versions. Restricted Boltzmann machines, a type of deep learning neural network, are another state-of-the-art approach.

Answer 55

Retailers: Amazon, Target; Movies + Music Sites: Netflix, last.fm, Pandora; Social networks: Facebook, Twitter; Grocery stores: Tesco; Content publishers: Ad networks: Yahoo!, Google; CRM: Next-best offer in marketing decision making

Answer 56

....Probabilistic and Bayesian versions, Restricted Boltzmann machines, a type of deep learning neural network, are another state-of-the-art approach.

Answer 57

We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.

Answer 58

to define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.

Answer 59

a single regression equation is much smaller and less complex than a regression tree, but tends to also be much less accurate.

Answer 60

decision trees which predict numeric quantities. The leaf nodes of these trees have a numeric quantity instead of a class. This numeric quantity is often decided by taking the average of all training set values to which the leaf node applies

Answer 61

A way of applying Occam's Razor to real problems. Penalizing the higher-order parameters to avoid overfitting and getting a simpler hypothesis.

Answer 62

lambda - controlls the trade-off between fitting the data well & keeping the parameters small.

Answer 63

an S-shaped mathamatical curve is often used to describe the activation function of a neuron over time

Answer 64

Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)

Answer 65

We are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Categorized into "regression" and "classification" problems.

Answer 66

use training data to first build a model which is later used to classify instances

Answer 67

can extrapolate information from one dimensional data (input space) and some information about weights & correlative relationships to another dimension (feature space)

Answer 68

divide an instance space by finding the line that is as far as possible from both classes. This line is called the "maximum-margin hyperplane"

Answer 69

when determining the maximum-margin hyperplane for a support vector machine, only the points near the hyperplane are important. These points near the boundary are called the support vectors

Answer 70

Powerful Jedi machine learning classifier. Among classification algorithms used in supervised machine learning, SVM usually produces the most accurate classifications. Read more about SVM in this article "The Importance of Location in Real Estate, Weather, and Machine Learning."

Answer 71

since support vector machines use dot-products (just like linear classifiers) when determining the hyperplane, they can be turned into a non-linear classifier by replacing the dot-product with a kernel such as the radial-basis function

Answer 72

open source library for SVMs written in C++ (w/a Java version as well). Trains an SVM model, makes predictions, and tests predictions w/in a dataset with support for kernel methods such as the radial-basis function

Answer 73

A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language processing), then quantification (e.g. identifying the presence or absence of key terms). Wikipedia

Answer 74

A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)

Answer 75

Coefficients are generally biased (as well as inconsistent) "high bias" fitting a line to a curve

Answer 76

Allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

Answer 77

don't use training data and only find structure within the data, such as in clustering where the instances are placed into groups. Other examples include non-negative matrix factorization and self-organizing maps

Answer 78

Free data mining package, for data exploration, profiling, mining, and visual analytics, containing hundreds of machine learning algorithms, techniques, and methods.

Answer 79

A collection of random variables, representing the evolution of some system of random values over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. Examples: Markov chains, stock market fluctuations, EKG, brownian motion, random walks

Answer 80

A mathematical formalization of a path that consists of a succession of random steps. For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks, although they may not be truly random in reality. The term random walk was first introduced by Karl Pearson in 1905.

Answer 81

Markov models are a kind of probabilistic model often used in language modeling. The observations are assumed to follow a Markov chain, where each observation is independent of all past observations given the previous one. In a Markov chain, a system transitions stochastically from one state to another. It is a memoryless process, in the sense that the distribution over the next state depends only on the current state, and not on the state at any past time. Markov chains are useful models of many natural processes and the basis of powerful techniques in probabilistic inference and randomized algorithms. A famous Markov chain is the so-called "drunkard's walk", a random walk on the number line where, at each step, the position may change by +1 or ?1 with equal probability. From any position there are two possible transitions, to the next or previous integer. The transition probabilities depend only on the current position, not on the manner in which the position was reached. For example, the transition probabilities from 5 to 4 and 5 to 6 are both 0.5, and all other transition probabilities from 5 are 0. These probabilities are independent of whether the system was previously in 4 or 6.

Answer 82

Stochastic gradient descent (SGD) is an iterative optimization algorithm that can be applied to functions that are a linear combination of differentiable functions. These types of functions often arise when the full objective function is a linear combination of objective functions at each data point, e.g. a least squares objective function. While batch gradient descent uses the full gradient of the function, SGD approximates the full gradient by using the gradient at each of the functions in the linear combination, e.g. the gradient of the objective function at each data point. SGD is often used to optimize non-convex functions, e.g. those that arise in neural networks.

Answer 83

In calculus (a branch of mathematics), a differentiable function of one real variable is a function whose derivative exists at each point in its domain. As a result, the graph of a differentiable function must have a (non-vertical) tangent line at each point in its domain, be relatively smooth, and cannot contain any breaks, bends, or cusps.

Answer 84

aka Recall or True positive rate. Percent of actual Positives that were correctly predicted.

Answer 85

Percent of predicted Positives that were correct.

Answer 86

aka True negative rate. Percent of actual Negatives that were correctly predicted.

Answer 87

Percent of all predictions that were correct.

Answer 88

Models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models.

Answer 89

Focuses only on discriminating one class from another. It can be more efficient once trained and requires fewer modeling assumptions. Methods include logistic regression, generalized linear classifiers, and nearest-neighbor.

Answer 90

Discriminative models learn to discriminate between different inputs. For example, classifying images as containing a dog or not containing a dog is a discriminative task. An example of a discriminative model is a support-vector machine. Generative models usually involve probabilities and their distinguishing feature is that you can generate new data from them. For example, if you tried to estimate the probability distribution over images containing dogs and a different distribution over images not containing dogs, you would have generatively modeled the situation. You could use these distributions to sample new images of dogs (or new images not containing dogs). If you wanted to use this generative model for a discriminative task, you could: given an image, you could see which of the two distributions assigns higher probability to that image and then choose that as your result. Thus, there is a distinction here between discriminative model and discriminative task: it may be possible to use generative models for discriminative tasks.

Answer 91

The Bayesian view is essentially that everything should be done with Bayes' rule: computing posterior probabilities by multiplying priors with likelihoods. In a Bayesian approach, you usually have a posterior distribution over models. Then, if you want to use this model for something, like making a prediction, you integrate over your posterior distribution of models to get a sort of "expected value" of the thing you are trying to predict. Frequentist is often used to mean not Bayesian. In a frequentist approach, you typically find a "best" solution (i.e., model) to the problem you are trying to solve. You then use this best model to make the prediction. I believe there is a relationship between the frequentist approach and discriminative models, and likewise for the Bayesian approach and generative models.

Answer 92

Ways of encoding the structure (independencies) of a probability distribution into a picture. The two main types of graphical models are directed graphical models and undirected graphical models, probability distributions represented by directed and undirected graphs respectively. Each node in the graph represents a random variable, and a connection between two nodes indicates a possible dependence between the random variables. So, for example, a fully disconnected graph would represent a fully independent set of random variables, meaning the distribution could be fully factored as P(x,y,z,...)=P(x)P(y)P(z)... Note that the graphs represent structures, not probabilities themselves.

Answer 93

Adding non-linear features to the representation of our data can make linear models much more powerful. However, often we don't know which features to add, and adding many features (like all possible interactions in a 100 dimensional feature space) might make computation very expensive. The Kernel Trick is a clever mathematical trick that allows us to learn a classifier in a higher dimensional space without actually computing the new, possibly very large representation. It works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion. There are two ways to map your data into a higher dimensional space that are commonly used with support vector machines: the polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 \*\* 2 \* feature2 \*\* 5), and the radial basis function (rbf) kernel, also known as Gaussian kernel. The Gaussian kernel is a bit harder to explain, as it corresponds to an infinite dimensional feature space. One way to explain the Gaussian kernel is that it considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.

Answer 94

Cross-validation based on k-folds splits your data into a number k of folds (portions of your data) of equal size. Then, each fold is held out in turn as a test set and the others are used for training. Each iteration uses a different fold as a test, which produces an error estimate. The process continues until all the k-folds are used once as a test set and you have a k number of error estimates that you can compute into a mean error estimate (the cross-validation score)

Machine Learning Flashcards

(134 cards)