Detect relationships or associations between specific values of categorical variables in large data sets.
Market basket analysis: uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z."
P(A | B) = P(B | A) * P(A) / P(B); P(A) being the number of instances of a given value divided by the total number of instances; P(B) is often ignored since this equation is typically used in a probability ratio that compares two different values for A, with P(B) being the same for both
A graphical formalism for representing the structure of a probabilistic model:
- Show the ways in which the random variables may depend on each other
- Good at representing domains with a causal structure
- Edges in the graph determine which variables directly influence which other variables
- Factorization structure of the joint probability distribution
- Encoding a set of conditional independence assumptions
AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.
We are trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering.
Cluster Analysis: Distance Metrics Between Items
1. Euclidean distance: The geometric distance between objects in the multidimensional space. The shortest path between two objects. It is used to obtain sphere-shaped clusters. 2. City block (Manhattan) distance. It corresponds to the sum of distances along each dimension and is less sensitive to outliers. It is used to obtain diamond-shaped clusters. 3. Cosine similarity measure. It is calculated by measuring the cosine of angle between two objects. It is used mostly to compute the similarity between two sets of transaction data.
Cluster Analysis: Distance Measures Between Clusters
In hierarchical clustering: 1. Average linkage: It is the average distance between all the points in two clusters. 2. Single linkage: It is the distance between nearest points in two clusters 3. Complete linkage: It is the distance between farthest points in two clusters.
Cluster Analysis: Hierarchical Clustering
Cluster Analysis: Gaussian Mixture Models (GMM)
An unsupervised learning technique for clustering that generates a mixture of clusters from the full data set using a Gaussian (normal) data distribution model for each cluster. The GMM's output is a set of cluster attributes (mean, variance, and centroid) for each cluster, thereby producing a set of characterization metadata that serves as a compact descriptive model of the full data collection.
Cluster Analysis: K-Means Overview
What: K-means is one of the most widely used clustering techniques because of its simplicity and speed. It partitions the data into a user specified number of clusters, k. Why: Simplicity, speed. It is fast for large data sets, which are common in segmentation.
Cluster Analysis: K-Means: Cautions
Clusters may converge to a local minimum. Due to this issue, the clusters that are obtained might not be the right ones. To avoid this, it might be helpful to run the algorithm with different initial cluster centroids and compare the results.
Cluster Analysis: K-Means: Scaling Options
Note that each iteration needs N × k comparisons, which determines the time complexity of one iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this algorithm can be considered linear in the dataset size. The k-means algorithm can take advantage of data parallelism. When the data objects are distributed to each processor, step 3 can be parallelized easily by doing the assignment of each object into the nearest cluster in parallel.
Cluster Analysis: K-Means: How
1. Initialization: The algorithm is initialized by picking the initial k cluster representatives or "centroids". These initial seeds can be sampled at random from the dataset, or by taking the results of clustering a small subset of the data;
2. Data Assignment. Each data point is assigned to its closest centroid, with ties broken arbitrarily. This results in a partitioning of the data.;
3. Recompute and reset the location of the "means". Each cluster representative is relocated to the center (mean) of all data points assigned to it.;
Now repeat step 3 and 4 until the convergence criterion is met (e.g., the assignment of objects to clusters no longer changes over multiple iterations) or maximum iteration is reached.
Cluster Analysis: K-Means: 4 Key Steps
1: Initialization of k centroids; 2: Data points assigned to nearest centroid; 3: Relocation of each mean to the center of it's points; 4: Repeat step 2 and 3 until assignments no longer change
K-Nearest Neighbors algorithm (for classification), K-Means (for clustering), K-itemsets (for association rule mining - see A, above), K-Nearest Neighbors Data Distributions (for outlier detection), KD-trees (for indexing and rapid search of high-dimensional data), and more KDD (Knowledge Discovery from Data) things.
K-NN is an algorithm that can be used when you have a bunch of objects that have been classified or labeled in some way, and other similar objects that haven't gotten classified or labeled yet, and you want a way to automatically label them.
Conditional Random Fields
A matrix showing the predicted and actual classifications. A confusion matrix is of size LxL, where L is the number of different label values. Rows for each of the actual values cross-tabbed against columns of the predicted values.
Optimization: Convex optimization
the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).
1) Find items which can be compared; 2) Calculate the sum for each set of those items; 3) Calculate the sum of squares for each set of those items; 4) Calculate the sum of the products for each set of those items; 5) Use the values from 2-4 to calculate the coefficient
a measure of how well two sets of data fit on a straight line; Gives a value between -1 and 1 with 1 meaning the sets vary in an identical way; Tends to give better results than Euclidean distance scores when data isn't well normalized, i.e. it corrects for grade inflation
A measurement of the cost to the performance task of making a prediction Y' when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications. Example: "Squared error function".
A method for estimating the accuracy of an inducer by dividing the data into k mutually exclusive subsets, or folds, of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)
each node in the tree tests an attribute, decisions are represented by the edges leading away from that node with leaf nodes representing the final decision for all instances that reach that leaf node
Decision Trees: Dealing with missing values
1) have a specific edge for no value; 2) track the number of instances that follow each path and assign that instance to the most popular edge; 3) give instances weights and split the instance evenly down each possible edge. Once the value has reached the leaf nodes, recombine it using the weights
Decision Trees: Pruining
since decision trees are often created by continuously splitting the instances until there is no more information gain, it is possible that some splits were done with too little information gain and result in overfitting of the model to the training data. Basic idea is to check child nodes to see if combining them with their parent would result in an increase in entropy below some threshold
Decision Trees: Replicated subtree problem
because only a single attribute is tested at a node, it is possible that two copies of the same subtree will need to be placed in a tree if the attributes from that subtree were not tested in the root node
Decision Trees: Testing nominal values
if a node has edges for each possible value of that nominal value, then that value will not be tested again further down the tree. If a node groups the values into subsets with an edge per subset, then that attribute may be tested again
Decision Trees: Testing numeric values
can be tested for less than, greater than, equal to, within some range. Can result in just two edges or multiple edges. Can also have an edge that represents no value. May be tested multiple times in a single path
Refers to a class of methods that includes neural networks and deep belief nets. Useful for finding a hierarchy of the most significant features, characteristics, and explanatory variables in complex data sets. Particularly useful in unsupervised machine learning of large unlabeled datasets. The goal is to learn multiple layers of abstraction for some data. For example, to recognize images, we might want to first examine the pixels and recognize edges; then examine the edges and recognize contours; examine the contours to find shapes; the shapes to find objects; and so on.
Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.
used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)
different features will have different effects on the accuracy of classifications. A scaling factor can be used for each feature to reduce its effect on the classification, possibly to 0. These scales can be optimized to not only improve classifications, but to also show the relative importance of the features.
Mean normalization of data to speed up gradient descend
Gaussian Mixture Models
1: measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.
Boosting: Gradient Boosted Decision Trees
An iterative optimization algorithm for finding a local minimum. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of "steepest descent"). Also known as steepest descent.
they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
HDPs or other Bayesian non-parametric model
The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing
Hidden Markov Models
A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts ("left" and "right" or maybe "above" and "below"). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn't divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.
replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function
gives a single number from two vectors by multiplying each value in the first vector by the corresponding value in the second vector and adding them all together
Classification: Linear Binary
Binary classification problems arise when we seek to separate two sets of data points in R^n, each corresponding to a given class. We seek to separate the two data sets using simple ''boundaries'', typically hyperplanes. Once the boundary is found, we can use it to predict the class a new point belongs to, by simply checking on which side of the boundary it falls.
Trying to fit a linear continuous function to the data. Univariate or Multivariate.
A kind of regression analysis often used when the dependent variable is dichotomous and scored 0 or 1. It is usually used for predicting whether something will happen or not, such as graduation, business failure, or heart attack-anything that can be expressed as event/non-event. Independent variables may be categorical or continuous in logistic regression analysis.
Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling methods that have highly structured model forms, such as linear regression, logistic regression and discriminant analysis are generally not considered part of machine learning. Unsupervised learning methods such as association rules and clustering are also considered part of machine learning.
A computer program is said to learn from from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Model Evaluation: Adjusted R^2 (R-Square)
The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)
Model Evaluation: Decision tables
simplest way of expressing output from machine learning, cells in table represent the resulting decision based on the row and column which represent the conditions
Model Evaluation: Mis-classification error
Define Test error as summed error for classification (false prediction)
Model Evaluation: Negative class
Negative means not having the symptoms
Model Evaluation: Positive class
presence of something we are looking for - 1
Model Evaluation: Precision
Of all predicted positives, how many are positive?
Model Evaluation: Recall
Of all positives how many were predicted as positive?
Model Evaluation: True negative
Hypotheses correctly predicts negative output.
Model Evaluation: True positive
Hypotheses correctly predicts positive output.
Model Selection Algorithm
algorithm that automatically selects a good model function for a dataset.
Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)
Multiclass classification. Reducing a classification problem with multiple features that have to be predicted to a simple classification problem by looking at one feature at a time. Then to determin the final prediction we take the max of all the predicted values.
Natural Language Processing
a field of computer science involved with interactions between computers and human languages
viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.
The science of describing and, especially, visualizing the connections among objects. The objects might be human, biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977 network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines representing connections).
Interconnected neural cells. With experience, networks can learn, as feedback strengthens or inhibits connections that produce certain results. Computer simulations of neural networks show analogous learning.
In neuronal networks the process of calculating the subsequent layers of the network. Each layer depends on the calculations done on the layer before it.
Neural Networks: Random Initialization
Symmetry breaking for neural networks is achieved by:
Optimization: genetic algorithms
starts with a set of random solutions called the "population". Costs are calculated for all these solutions and the best solutions are saved (called "elitism"). The other solutions are discarded and are replaced using two methods. The first is "mutation" where simple random changes are made to the elite solutions. The second is "crossover" or "breeding" where two elite solutions are combined, such as by replacing random values in one with values in the other. This is repeated either for a set number of iterations or until several iterations have passed with no improvement.
Optimization: hill climbing
starts with a random solution and looks at the set of neighboring solutions for those that are better, this is repeated continuously until no neighboring solution has a lower cost. Major drawback is that it will get stuck in a local minimum
Optimization: Information gain
difference between the current entropy and the weighted-average entropy of the two new groups. Used when deciding which attribute to use when splitting instances in a decision tree
Optimization: random-restart hill climbing
running a hill climbing algorithm several times using new random initial input to attempt to reach the global minimum
Optimization: simulated annealing
similar to hill climbing, but adds a new variable "temperature" which starts high and gradually cools. With each iteration one variable is randomly changed in a certain direction. The new cost is compared to the old cost and is accepted if the new cost is lower. If the new cost is higher, it may be accepted based on a probability that is computed from the current "temperature". As the temperature cools, it gets less and less likely that the worse solution would be accepted.
Optimization: when do optimization algorithms fail?
when there is no clear slope in the change in cost function as the values are varied. This is because these algorithms all attempt to follow that slope toward a global minimum.
When the model describes an error or noise instead of the actual model. A problem that occurs when data-mining is used to create overly complex models that are not suited to making accurate predictions. "High variance". If we have too many features the learned hypothesis may fit the training set very well, but fails to generalzie to new examples.
Partial differential equations
A decision tree classifier that produces a "forest of trees", yielding highly accurate models, essentially by iteratively randomizing one input variable at a time in order to learn if this randomization process actually produces a less accurate classifier. If it doesn't, then that variable is ousted from the model.
Recommendation Systems: Content-based filtering
Gathers information (e.g., demographics, genre, keywords, preferences, survey responses) to generate a profile for each user or item. Users are matched to items based on their profiles. Example: Pandora's Music Genome Project.
Recommendation Systems: Collaborative filtering
Based on past user behavior. Each user's history of behaviors (ratings, purchases, or viewing history) is used to make associations between users with similar behavior and between items of interest to the same users. Example: Netflix. Methods: 1. Neighborhood-based methods, based on user-user or item-item distances; 2. Latent factor or reduced- dimension models, which automatically discover a small number of descriptive factors for users and items; 3. Low-rank matrix factorization is the best-known example of reduced-dimension models and is among the most flexible and successful methods underlying recommendation systems. There are many variants of matrix factorization, including probabilistic and Bayesian versions. Restricted Boltzmann machines, a type of deep learning neural network, are another state-of-the-art approach.
Recommendation Systems: Companies using them
Retailers: Amazon, Target; Movies + Music Sites: Netflix, last.fm, Pandora; Social networks: Facebook, Twitter; Grocery stores: Tesco; Content publishers: Ad networks: Yahoo!, Google; CRM: Next-best offer in marketing decision making
Recommendation Systems: Collaborative filtering: Matrix Factorization
....Probabilistic and Bayesian versions, Restricted Boltzmann machines, a type of deep learning neural network, are another state-of-the-art approach.
We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.
to define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.
a single regression equation is much smaller and less complex than a regression tree, but tends to also be much less accurate.
decision trees which predict numeric quantities. The leaf nodes of these trees have a numeric quantity instead of a class. This numeric quantity is often decided by taking the average of all training set values to which the leaf node applies
A way of applying Occam's Razor to real problems. Penalizing the higher-order parameters to avoid overfitting and getting a simpler hypothesis.
lambda - controlls the trade-off between fitting the data well & keeping the parameters small.
an S-shaped mathamatical curve is often used to describe the activation function of a neuron over time
Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)
We are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Categorized into "regression" and "classification" problems.
use training data to first build a model which is later used to classify instances
Support Vector Machine
can extrapolate information from one dimensional data (input space) and some information about weights & correlative relationships to another dimension (feature space)
Support Vector Machine
divide an instance space by finding the line that is as far as possible from both classes. This line is called the "maximum-margin hyperplane"
Support Vector Machine
when determining the maximum-margin hyperplane for a support vector machine, only the points near the hyperplane are important. These points near the boundary are called the support vectors
Support Vector Machine
Powerful Jedi machine learning classifier. Among classification algorithms used in supervised machine learning, SVM usually produces the most accurate classifications. Read more about SVM in this article "The Importance of Location in Real Estate, Weather, and Machine Learning."
Support Vector Machine: Kernels
since support vector machines use dot-products (just like linear classifiers) when determining the hyperplane, they can be turned into a non-linear classifier by replacing the dot-product with a kernel such as the radial-basis function
Support Vector Machine: libsvm
open source library for SVMs written in C++ (w/a Java version as well). Trains an SVM model, makes predictions, and tests predictions w/in a dataset with support for kernel methods such as the radial-basis function
A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language processing), then quantification (e.g. identifying the presence or absence of key terms). Wikipedia
Time Series Analysis
A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)
Coefficients are generally biased (as well as inconsistent) "high bias" fitting a line to a curve
Allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
don't use training data and only find structure within the data, such as in clustering where the instances are placed into groups. Other examples include non-negative matrix factorization and self-organizing maps
Free data mining package, for data exploration, profiling, mining, and visual analytics, containing hundreds of machine learning algorithms, techniques, and methods.
A collection of random variables, representing the evolution of some system of random values over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. Examples: Markov chains, stock market fluctuations, EKG, brownian motion, random walks
A mathematical formalization of a path that consists of a succession of random steps. For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks, although they may not be truly random in reality. The term random walk was first introduced by Karl Pearson in 1905.
Markov models are a kind of probabilistic model often used in language modeling. The observations are assumed to follow a Markov chain, where each observation is independent of all past observations given the previous one. In a Markov chain, a system transitions stochastically from one state to another. It is a memoryless process, in the sense that the distribution over the next state depends only on the current state, and not on the state at any past time. Markov chains are useful models of many natural processes and the basis of powerful techniques in probabilistic inference and randomized algorithms. A famous Markov chain is the so-called "drunkard's walk", a random walk on the number line where, at each step, the position may change by +1 or ?1 with equal probability. From any position there are two possible transitions, to the next or previous integer. The transition probabilities depend only on the current position, not on the manner in which the position was reached. For example, the transition probabilities from 5 to 4 and 5 to 6 are both 0.5, and all other transition probabilities from 5 are 0. These probabilities are independent of whether the system was previously in 4 or 6.
Stochastic Gradient Descent (SGD)
Stochastic gradient descent (SGD) is an iterative optimization algorithm that can be applied to functions that are a linear combination of differentiable functions. These types of functions often arise when the full objective function is a linear combination of objective functions at each data point, e.g. a least squares objective function. While batch gradient descent uses the full gradient of the function, SGD approximates the full gradient by using the gradient at each of the functions in the linear combination, e.g. the gradient of the objective function at each data point. SGD is often used to optimize non-convex functions, e.g. those that arise in neural networks.
In calculus (a branch of mathematics), a differentiable function of one real variable is a function whose derivative exists at each point in its domain. As a result, the graph of a differentiable function must have a (non-vertical) tangent line at each point in its domain, be relatively smooth, and cannot contain any breaks, bends, or cusps.
aka Recall or True positive rate. Percent of actual Positives that were correctly predicted.
Percent of predicted Positives that were correct.
aka True negative rate. Percent of actual Negatives that were correctly predicted.
Percent of all predictions that were correct.
Models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models.
Focuses only on discriminating one class from another. It can be more efficient once trained and requires fewer modeling assumptions. Methods include logistic regression, generalized linear classifiers, and nearest-neighbor.
Discriminative vs. Generative
Discriminative models learn to discriminate between different inputs. For example, classifying images as containing a dog or not containing a dog is a discriminative task. An example of a discriminative model is a support-vector machine. Generative models usually involve probabilities and their distinguishing feature is that you can generate new data from them. For example, if you tried to estimate the probability distribution over images containing dogs and a different distribution over images not containing dogs, you would have generatively modeled the situation. You could use these distributions to sample new images of dogs (or new images not containing dogs). If you wanted to use this generative model for a discriminative task, you could: given an image, you could see which of the two distributions assigns higher probability to that image and then choose that as your result. Thus, there is a distinction here between discriminative model and discriminative task: it may be possible to use generative models for discriminative tasks.
Frequentist vs. Bayesian
The Bayesian view is essentially that everything should be done with Bayes' rule: computing posterior probabilities by multiplying priors with likelihoods. In a Bayesian approach, you usually have a posterior distribution over models. Then, if you want to use this model for something, like making a prediction, you integrate over your posterior distribution of models to get a sort of "expected value" of the thing you are trying to predict. Frequentist is often used to mean not Bayesian. In a frequentist approach, you typically find a "best" solution (i.e., model) to the problem you are trying to solve. You then use this best model to make the prediction. I believe there is a relationship between the frequentist approach and discriminative models, and likewise for the Bayesian approach and generative models.
Probabilistic Graphical Model (a.k.a. Graphical Model)
Ways of encoding the structure (independencies) of a probability distribution into a picture. The two main types of graphical models are directed graphical models and undirected graphical models, probability distributions represented by directed and undirected graphs respectively. Each node in the graph represents a random variable, and a connection between two nodes indicates a possible dependence between the random variables. So, for example, a fully disconnected graph would represent a fully independent set of random variables, meaning the distribution could be fully factored as P(x,y,z,...)=P(x)P(y)P(z)... Note that the graphs represent structures, not probabilities themselves.
Adding non-linear features to the representation of our data can make linear models much more powerful. However, often we don't know which features to add, and adding many features (like all possible interactions in a 100 dimensional feature space) might make computation very expensive. The Kernel Trick is a clever mathematical trick that allows us to learn a classifier in a higher dimensional space without actually computing the new, possibly very large representation. It works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion. There are two ways to map your data into a higher dimensional space that are commonly used with support vector machines: the polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5), and the radial basis function (rbf) kernel, also known as Gaussian kernel. The Gaussian kernel is a bit harder to explain, as it corresponds to an infinite dimensional feature space. One way to explain the Gaussian kernel is that it considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.
Cross-validation based on k-folds splits your data into a number k of folds (portions of your data) of equal size. Then, each fold is held out in turn as a test set and the others are used for training. Each iteration uses a different fold as a test, which produces an error estimate. The process continues until all the k-folds are used once as a test set and you have a k number of error estimates that you can compute into a mean error estimate (the cross-validation score)