Active Learning

Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal experimental design. There are situations in which unlabeled data is abundant but manually labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning.

Association Rules

Detect relationships or associations between specific values of categorical variables in large data sets. Market basket analysis: uncover hidden patterns in large data sets, such as “customers who order product A often also order product B or C” or “employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z.”

Bayes Theorem

P(A | B) = P(B | A) * P(A) / P(B); P(A) being the number of instances of a given value divided by the total number of instances; P(B) is often ignored since this equation is typically used in a probability ratio that compares two different values for A, with P(B) being the same for both

Bayesian Networks

Bayesian networks are a graphical formalism for representing the structure of a probabilistic model, i.e. the ways in which the random variables may depend on each other. Intuitively, they are good at representing domains with a causal structure, and the edges in the graph determine which variables directly influence which other variables. They can be equivalently viewed as representing a factorization structure of the joint probability distribution, or as encoding a set of conditional independence assumptions about the distribution.

Deeply understand the top 20 concepts

…

Discriminative vs. Generative

Discriminative models learn to discriminate between different inputs. For example, classifying images as containing a dog or not containing a dog is a discriminative task. An example of a discriminative model is a support-vector machine. Generative models usually involve probabilities and their distinguishing feature is that you can generate new data from them. For example, if you tried to estimate the probability distribution over images containing dogs and a different distribution over images not containing dogs, you would have generatively modeled the situation. You could use these distributions to sample new images of dogs (or new images not containing dogs). If you wanted to use this generative model for a discriminative task, you could: given an image, you could see which of the two distributions assigns higher probability to that image and then choose that as your result. Thus, there is a distinction here between discriminative model and discriminative task: it may be possible to use generative models for discriminative tasks.

Elastic Nets

…

Ensemble Learning

Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.

Factor Analysis

used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)

Feature Scaling

Different features will have different effects on the accuracy of classifications. A scaling factor can be used for each feature to reduce its effect on the classification, possibly to 0. These scales can be optimized to not only improve classifications, but to also show the relative importance of the features.

Feature Selection

…

Frequentist vs. Bayesian

The Bayesian view is essentially that everything should be done with Bayes’ rule: computing posterior probabilities by multiplying priors with likelihoods. In a Bayesian approach, you usually have a posterior distribution over models. Then, if you want to use this model for something, like making a prediction, you integrate over your posterior distribution of models to get a sort of “expected value” of the thing you are trying to predict. Frequentist is often used to mean not Bayesian. In a frequentist approach, you typically find a “best” solution (i.e., model) to the problem you are trying to solve. You then use this best model to make the prediction. I believe there is a relationship between the frequentist approach and discriminative models, and likewise for the Bayesian approach and generative models.

Generative Approach

Models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models.

Graph Databases

they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.

HDPs or other Bayesian non-parametric model

…

Hyperplane

A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn’t divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.

Kernel Trick

Adding non-linear features to the representation of our data can make linear models much more powerful. However, often we don’t know which features to add, and adding many features (like all possible interactions in a 100 dimensional feature space) might make computation very expensive. The Kernel Trick is a clever mathematical trick that allows us to learn a classifier in a higher dimensional space without actually computing the new, possibly very large representation. It works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion. There are two ways to map your data into a higher dimensional space that are commonly used with support vector machines: the polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5), and the radial basis function (rbf) kernel, also known as Gaussian kernel. The Gaussian kernel is a bit harder to explain, as it corresponds to an infinite dimensional feature space. One way to explain the Gaussian kernel is that it considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.

Lagrange

…

LDA

…

Machine Learning

A computer program is said to learn from from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Matrix Factorization

…

Monte-Carlo Simulation

Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)

Network Analytics

The science of describing and, especially, visualizing the connections among objects. The objects might be human, biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary’s classic 1977 network diagram of a karate club reveals the centrality of two individuals, and presages the club’s subsequent split into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines representing connections).

Overfitting

When the model describes an error or noise instead of the actual model. A problem that occurs when data-mining is used to create overly complex models that are not suited to making accurate predictions. “High variance”. If we have too many features the learned hypothesis may fit the training set very well, but fails to generalzie to new examples.

Random Walks

A mathematical formalization of a path that consists of a succession of random steps. For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks, although they may not be truly random in reality. The term random walk was first introduced by Karl Pearson in 1905.

Stochastic Process

A collection of random variables, representing the evolution of some system of random values over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. Examples: Markov chains, stock market fluctuations, EKG, brownian motion, random walks

Underfitting

Coefficients are generally biased (as well as inconsistent) “high bias” fitting a line to a curve

Joint Probability Distribution

In the study of probability, given at least two random variables X, Y, …, that are defined on a probability space, the joint probability distribution for X, Y, … is aprobability distribution that gives the probability that each of X, Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

Supervised Learning

In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features.

Unserpervised Learning

The data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. Some unsupervised learning problems are:

- Given detailed observations of distant galaxies, determine which features or combinations of features are most important in distinguishing between galaxies.
- Given a mixture of two sound sources (for example, a person talking over some music), separate the two (this is called the blind source separation problem).
- Given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.

Dimensionality Reduction

Dimensionality reduction is the task of deriving a set of new artificial features that is smaller than the original feature set while retaining most of the variance of the original data.

The most common technique for dimensionality reduction is called Principal Component Analysis. PCA can be done using linear combinations of the original features using a truncated Singular Value Decomposition of the matrix X so as to project the data onto a base of the top singular vectors.

Bias-variance tradeoff

Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set.

But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation

Typically, we choose k=5k=5 or k=10k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.