definitions and terms Flashcards

1
Q

data science

A

a set of fundamental principles that guide the extraction of knowledge from data, the main aim of which in the business community is to improve decision making; a combination of analytical engineering and exploration; data science is a broader term than “data mining”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

data mining

A

data mining focuses on the automated search for knowledge, patterns, or regularities from data; aka KDD: Knowledge Discovery and Data mining; re formal statistical techniques, data mining might be considerd partly as hypothesis generation (vs testing)–ie can we find patterns in the data in the first place (whence ordinary statistical hypothesis testing might be applied)?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

machine learning

A

machine learning amounts to, for one, the collection of methods for extracting (predictive) models from data; ML is concerned with the analysis of data to find useful or informative patterns; the methods were developed from machine learning, applied statistics, and pattern recognition

vs. data mining, ML is more general in application, eg to robotics or computer vision, and may put more emphasis on theory (than real-world application), while data mining is more narrowly concerned with practical, commercial and business applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

model (for machine learning)

A

a model is an abstraction that can perform a prediction, (re-)action or transformation to or in respect of an instance of input values; a simplified representation of reality created to serve a purpose; it is usually simplifed based on fitting it to a specific purpose, or perhaps (simplified) based on constraints on information or tractability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

predictive vs descriptive models

A

predictive model: a formula for estimating the unknown value of interest: the target; the formula is often mathematical or a logical statement (such as a rule), and usually (the formula) is a hybrid of the two

descriptive model: a model whose primary purpose is to gain insight into the underlying phenomenon or process (eg a descriptive model of customer churn behavior would reveal what feature attributes customers who churn (leave) typically have); in some sense, we have all the data, and are trying to understand it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

induction

A

the process of generalizing from specific cases to general rules, laws, or truths

the creation of models from data is termed model induction; the tool or procedure that creates the model from the data is aka the induction algorithm, or learner; eg a linear regression procedure will induce a ~parametrized-surface model to fit the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

instance

A

an instance is aka an example, representing a data point, what is described by a set of attributes or predictors, and what sometimes conventionally amounts to a row in a database or spreadsheet; an instance is sometimes called a feature vector, and in context of statistics may be called a “case”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

loss function

A

a loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value (some form of aggregate penalty may be used for training the model, or evaluating the model after it’s trained)

some types:

  • zero-one function: penalty=0 for correct decision; penalty=1 for incorrect decision; often used for classification problems
  • hinge function / hinge loss: in context of how “wrong-sided” from a desired separation boundary an instance is in the attribute phase space (the loss graph looks like a hinge); often used for classification problems, the penalty increases the more on the wrong side of the dividing line an instance is
  • squared function: squared error is the square of the distance from the desired value; often used in regression contexts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

generalization

A

the property of a model or modeling process, whereby the model applies to data that were not used to build the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

overfitting

A

finding chance occurrences in the dataset (that seem to fit interesting patterns but) that do not generalize is called overfitting the data

the underlying reasons for overfitting when building models from data are essentially problems of multiple comparison

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

mutual information

A

the amount of information one can obtain from one random variable given another; a measure of dependence or “mutual dependence” between two random variables

I(X;Y) = H(X)-H(X|Y) = H(Y)-H(Y|X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

classifiers and “positive” vs “negative” examples

A

for a classifier, a bad outcome is regarded as a “positive” example, and a good outcome is regarded as a “negative” example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

data leakage

A

in supervised learning where the training data instances inadvertantly contain information on the same-instances target (eg putting the value of the target variable “in” the attribute vector accidentally)–the target / label value leaks into the attribute / feature vectors we’re training on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

base rate

A
  • for classification problems, the base rate classifier is (usually) the majority class in the dataset; the base rate is then the number of times that class appears in the dataset, divided by dataset size
  • for regression problems, the baseline is simply the mean or median value of the numeric target variable–a simple model that always predicts this average value exhibits base rate performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

validation set

A

an intermediate holdout set used to optimize different classes of models or, say, over a region in parameter space; after finalizing the model an outer holdout or final test set may be used for performance metrics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

sequential forward selection (SFS) / sequential backward selection (SBS)

A

SFS is a method for choosing relevant features for model building, using an iterative process and holdout nesting / cross-validation, starting with a single feature, optimizing, then selecting for another feature to pair with it, and so on

SBS goes “backwards” from some oversize set of features, paring them down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

regularization (aka shrinkage)

A

when fitting a function-based model to the data, we include not just the accuracy of the fit, but also the simplicity of the model; we’re fitting on both accuracy and simplicity

18
Q

the curse of dimensionality

A

there are so many feature attributes, ie the phase space is of such high dimension, it makes things difficult

eg for nearest neighbor methods, there may be so many irrelevant dimensions that the distance metric fails on all these extraneous distance components

19
Q

Jaccard distance

A

used eg for nearest neighbor models–treat each instance object as a set of elements (eg whiskey attributes, like “light yellow” and “salty” and “peaty”, etc)

the distance treats two instance objects as sets of characteristics, with binary “in or out” tags for each attribute (as w/ whiskey case study); the distance is one minus the cardinality of the set intersection (logical AND) of the two instance sets, divided by the cardinality of the set union (logical OR) of the two instance sets (it’s close to 1 if the sets have little in common, and goes to 0 if the sets are identical)

20
Q

cosine distance

A

used eg in text classification contexts, to measure the similarity of two documents

for feature vectors x and y, it’s: 1-x<dot>y / ||x||_2 ||y||_2; ie 1 minus the dot product of feature vectors, divided by L2 norm product (so really 1 minus the cosine of the angle between the vectors)</dot>

21
Q

class priors

A

in classification problems the class prior is an estimate of the probability that randomly sampling an instance from a population will yield the given class (regardless of any attributes of the instance); this has overtones of Bayes, where the “prior probability distribution” is the distribution under consideration “prior” to the reception of “new information” (such as that gleaned by a predictive model)

22
Q

precision and recall

A

recall: true positives divided by (true positives + false negatives) (same as true positive rate); this is aka sensitivity

precision: the ratio true negatives divided by (true negatives + false positives); this is aka positive predictive value; this is aka specificity

23
Q

F-measure, aka F1 score

A

harmonic mean of precision and recall (or equivalently of specificity and sensitivity): 2*(precision*recall)/(precision + recall)

features:

  • if both sensitivity (recall) and precision (specificity) are “perfect” at 1.0, then the F-measure is 1.0 as well
  • if both sensitivity (recall) and precision (specificity) are “worst” at epsilon, then the F-measure is eps^2/eps = eps
  • the F-measure may be useful for imbalanced data sets, where there is the risk of a classifier (where there are very few positive cases) “hedging” by just predicting all cases to be negative–the F1 score may be used to score the model(s) for training
24
Q

ranking classifier

A

a ranking classifier does more than just predict a class for the test instance; a ranking classifier returns, with the predicted class membership, a “certainty” level (eg a score in [0,1]; note this may be different from a true probability estimate of class inclusion); the instances may then be ranked in order of predicted certainty level

features:
* ranking classifiers allow setting better cutoffs re accuracy, and allow more detailed performance analysis, through eg confusion matrices and profit curves, ROC curves, and lift curves
* note, a confusion matrix for a ranking classifier with binary categories generally means we’ve assumed the “top” n instances are “positive,” whence, for the 2x2 classification table the top row has n entries, and the bottom has the rest of the entries in the test set

25
Q

lift (association rules)

A

lift(A,B) = p(A,B) / p(A)p(B) (p(A,B) is probability of A and B occurring together)

eg this can come up in context of co-occurrences–here the probability that items A and B are bought together, relative to the product of probabilities they’re bought at all (ie the probability of both if they were independent of each other)–so this is a measure of how beyond-independent the co-occurrence is

26
Q

leverage (association rules)

A

leverage(A,B) = p(A,B) - p(A)p(B) (p(A,B) is probability of A and B occurring together)

27
Q

association terms (re co-occurrence relations)

A

the “support” of the association is the prevalence of both elements occurring together (P(A,B))

the “strength” of the association is the conditional probability, eg: p(lottery tickets|beer) = 67% (from P(A)P(B|A) = P(A,B) where P(beer)=0.3 and P(beeer,tickets)=0.2)

28
Q

hyperparameter

A

in machine learning (eg) a hyperparameter is a parameter that is set from outside the actual training runs, such as the learning rate (for eg gradient descent in a neural network), which specifies details of the learning process; this is contrasted with parameters that determine the model itself (such as weights in neural network)

29
Q

discriminative vs generative methods

A

discriminative:

  • find the best way to distinguish target values in the attribute phase space; ie find the best way to discriminate targets; learn the (hard or soft) boundary between classes
  • based on P(y|x) (something like, what is the probability of class value y, given instance x)
  • to predict the label y from the training instance x, evaluate, f(x) = argmax_y P(y|x) (ie what is the maximal probable class y conditioned on the given instance x; eg for a decision tree, such a process is ~literally followed–we bin training instances to a given region, amounting to a leaf, then we compute maximum probability based on the leaf target label distribution); this is simply modeling the boundary between classes

generative:

  • model how different target segments produce feature values; ie we’re modeling how the data was generated; when given a test instance, ask, which class most likely generated this example?; model the distribution of individual classes
  • based on P(x,y) (something like, what is the probability of instance x and class y)
  • to predict label y from the training instance x, evaluate, f(x) = argmax_y P(x|y)P(y) (used for eg Naive Bayes); note that P(x|y)P(y) = P(x,y)–ie this is explicitly modeling the distribution of each class (over the attribute phase space); each instance gets its own little mini-distribution, over all possible class outcomes
30
Q

analytic engineering

A

the methods of bringing to bear existing data mining and data engineering techniques on a busines problem; eg a cell phone customer churn problem can be analyzed in a data mining / data engineering context to formulate appropriate models, appropriate baselines, appropriate data acquisition (possibly funding for more data, if it’s deemed important), etc.

31
Q

ensemble model

A

an ensemble model is a combination of many different recommendation models

32
Q

kriging

A

a method used sometimes in spatial analysis, working kind of like a spline fit to interpolate/extrapolate between points

this method has a Bayesian flavor, by assuming a Gaussian prior (so by the prior the sample points are drawn from a Gaussian pdf/pmf), and by inferring a Gaussian likelihood function from the observed points, which, combined with the prior yields the posterior distribution

33
Q

semi-supervised learning

A

machine learning with datasets that contain a mixture of labeled and unlabeled data (in the case of the labeled data only being of a certain category (eg in binary, “positive”), it is referred to as the subcase of “positive-unlabeled” learning)

eg “label propogation”–label propogation is based on the assumption that closer data points have similar class labels; so if we have a labeled instance, we can consider “propogating” its label through to “close by” (so presumably similar) instances; once this propogation is done, we may then apply (fully) supervised learning to the data

34
Q

A/B testing

A

basic technique from statistics, and comes up eg in context of causal modeling; a general term for a method of comparing the outcomes of two different choices; there are two treatments and one acts as the control for the other

also seen is the term “A/A” testing, which is just using the same choice on both the control and treatment group (ideally, I suppose, there should be no difference between the 2)

35
Q

softmax function

A

this basically converts any vector (over reals) into something like a pdf

for each component v_i of vector v, take quotient, exp(v_i) / sum(exp(v_i))

it’s called “softmax” because it tends to “amplify” the largest component(s) of v, via taking the exponential; eg for (1,1.1,3), the softmax is (0.11,0.12,0.78)

36
Q

softmax regression

A

this is similar in spirit to logistic regression

instead of logistic regression’s ~scheme of giving a Bernoulli pmf to a binary outcome, softmax regression can produce a multinomial (generalized Bernoulli) pmf “predictor”–ie it can deal with more than 2 categories, producing a pdf over the various categories (similar to logistic regression, it can have a linear classifier as the “kernel” to the exponentials)

37
Q

feature engineering

A

feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning; feature engineering is a machine learning technique that leverages data to create new variables that aren’t in the training set

38
Q

classification model entropy and information gain

A

each segment has its target variable value proportions treated as probabilities

each segment gets an entropy score

a kind of total entropy is calculated, based on adding all segment entropies, weighted by the segments’ sizes relative to total dataset size

information gain = H(S) - H(S|x) = h(parent) - ( w_1 * h_1 + … + w_k * h_k), where h(parent) is the whole-set entropy before the classification model has been applied (note similarity to general mutual information formula re entropy)

39
Q

Gini impurity for classification models

A

Gini impurity measures how often a randomly chosen element of a set would be incorrectly labeled if it were labeled randomly and independently according to the distribution of labels in the set

this works out to be, 1-sum_i (p_i)^2, where the p_i are class relative frequencies in the set

for a decision tree, the terminal-leaf frequency weighted sum of all the terminal leaf Gini impurities is taken

40
Q

variance reduction for regression models

A

used for eg CART algorithm

the variance before the model segmentation is computed

the variance is computed after segmentation by summing over all segment-variances; each segment variance is the MSE within that segment–ie sum_i (y_i-mu)^2, where mu is the mean of the instances in that segment

41
Q

Laplace correction

A

used eg in probability classification trees to smooth category probabilities in leafs with few samples

eg binary categories: (n+1)/(n+m+2)