Final Flashcards

(77 cards)

1
Q

Entity

A

Object, instance, observation, element, example, line, row, feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Attribute

A

characteristic, (independent/dependent) variable, column, feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised setting

A

to identify a pattern (descriptive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Supervised setting

A

to predict (predictive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Induction

A

Generalizing from a specific case to general rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Deduction

A

Applying general rules to create other specific facts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Induction is developing …

A

Classification and regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deduction is using

A

Classification and regression models (apply induction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Supervised segmentation

A

How can we segment the population into groups that differ from each with respect to some quantity of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Entropy

A

Separating different data points using maths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Entropy is a method

A

That tells us how ordered a system is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Information gain

A

the difference between parent entropy and the sum of the children entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

La place correction

A

learn underlying distribution of the data that generated the data we are working with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Support vector machine

A

computes the line (hyperplane) that best separates the data points which are closest to the decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Support Vector Machines (SVM) –>

A

If you don’t have data which gives you probabilities but only gives you ranking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Overfitting

A

Tendency of methods to tailor models exactly to the training data/ finding false patterns through chance occurrences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Overfitting leads to

A

lack of generalization: model cannot predict on new cases (out-of-sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Bias

A

difference between predicted and real data (when missing the real trends = underfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variance

A

variation caused by random noise (modeling by random noise –> overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

SVM sensitive to outliers?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Logistic regression sensitive to outliers?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Increase of complexity in classification trees

A

number of nodes and small leave size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Increase of complexity in regressions

A

number of variables, complex functional forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Avoiding overfitting (ex ante)

A

Min size of leaves, max number of leaves, max length of paths, statistical tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Avoiding overfitting (ex post, based on holdout & cross-validation)
Pruning, sweet spot, ensemble methods (bagging, boosting, random forrest)
26
Ensemble methods
one model can never fully reduce overfitting --> use multiple models
27
Avoid overfitting: logistic regression --> solution for a too complex relationship
Regularization --> Ridge regression (L2-norm penalty) & Lasso regression (L1-norm penalty)
28
Distance (measures)
Manhattan, Euclidean, Jaccard, Cosine
29
Clustering
Use methods to see if elements fall into natural groupings (historical clustering, k-means clustering)
30
Accuracy
number of correct decision made / total number of decisions made (TP+TN)/(P+N)
31
Problems with accuracy
Unbalanced classes & Problems with unequal costs and benefits
32
Classification
model is used to classify instances in one category
33
Ranking
model is used to rank-order instances by the likelihood of belonging to a category
34
Visualization: Profit curves
When you know the base rate and the classifiers, and know the cost benefits. Determines the best classifier to obtain maximum expected profit
35
Visualization: ROC graphs
If we don't have the cost benefits and the base rate, our sample is balanced. Compares the classification performance of models, compare the rank-order performance of models They plot false positive and true positive rate for the different classifiers The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).
36
Visualization: Cumulative response curves
Are intuitive, demonstrate model performance
37
Visualization: Lift curve
Shows the effectiveness of classifiers. Performance of of rank-ordering classifiers compared to random
38
Naive Bayes' rule different from Bayes' rule
Naive assumes that the probability of testing positive to one test, given someone has cancer is independent from all other test results e.g.
39
Bag of words approach
treat every document as a collection of individual tokens. Pre-process the text --> term frequencies --> normalized frequency --> determine outcome
40
Advanced text analysis methods
N-gram sequences & named entity extraction & topic models
41
co-occurrence and association rules
idea: to measure the tendency of events to (not) occur together. co-occurrence: measures the relation between one X and one Y. association rules: measure the relation between multiple X's and one Y
42
Profiling
to find a typical behavior/ features of an individual/ group/ entity. To predict future behavior, to detect abnormal behavior
43
Link prediction
to predict connections between entities based upon dyadic similarities (similarities between pairs of entities) and existing links to others. Levels of analysis: dyadic: firms or people (not single). Methods: various: regression, social network analysis etc
44
Latent dimensions and data reductions
to replace a large dataset with a long list of variables with a smaller dataset minimizing information loss. statistical techniques allow us to reduce the original list of variables to fewer key dimensions or 'factors'
45
sustaining competitive advantage with data science
VRIO questions --> valuable, rare, imitability, organization
46
Sustainability factors
Historical advantage, Unique IP, Unique complementary assets, Superior data scientist, Superior data science management
47
LaPlace correction --> formula to learn the distribution/nature of data
P(c)= (n+1)/(n+m+2)
48
Support Vector Machines (pros + cons)
Pro: simple fast/ flexible loss function/ non-linear functions. Cons: relatively unknown/ may not give solutions/ may require large sample size
49
Logistic regression (pros + cons)
Pro: importance of individual factors/ pretty well-known. cons: time consuming, no solution, minimum observations
50
Avoid overfitting - hold out
Use part of the data set to train model (training data set) --> use remaining part of the data to test predictive performance of the model (holdout data)
51
Avoid overfitting - cross-validation
use different parts of the data sets as hold out data --> repeat hold out method many time on different parts
52
Pruning
Grow large tree with training data --> cut branches that do not improve accuracy based on hold-out data (and replace them with a leaf)
53
Sweet spot
Create many trees with increasing complexity (number of nodes) --> evaluate predictive performance. Find the optimal complexity based on performance on hold-out data
54
Bagging (ensemble method)
Repeatedly select random subset of the obseravtions in the dataset --> create separate trees with each of these subsets. combine all predictions
55
Boosting (ensemble method)
Select a random subset of the observations in the dataset to create first tree --> select another random subset + wrong predictions of first model to create second model and third etc. combine predictions from all trees
56
Random forest (ensemble method)
Repeatedly select random subset of the variables in the dataset, create separate trees for each of these subsets. combine predictions from all trees
57
Manhattan distance
Given by sum of absolute differences
58
Euclidean distance
Square root of sum squared differences
59
Jaccard distance
Similarity equals the intersect divides by the union (can range from 0 -1). 1 - the divisions they have in common/all divisions
60
Cosine distance
Measure of similarity equals similar frequencies in ocurrences (can range from 0-1). Sum of the square of each percentages of each division. 1 of the companies have no overlap, 0 if they have the same distribution.
61
Historical clustering
compute distances among all objects/ clusters, group closest objects/clusters together, repeat. distance measure: manhattan, euclidean. linkage function: distance between cluster centres, distance between nearest object, etc
62
k-means clustering
determine number of clusters (k), and put k 'centroids' at random positions. Determine for each element closest cluster (centroid). move each centroid to its cluster means. repeat.
63
How to choose the optimal complexity
Nested hold-out method / nested cross- validation
64
Precision
If you want to minimize false positives TP/(TP+FP)
65
Recall
If you want to minimize false negatives Total positive rate = TPR = TP/(TP+FN) = same as sensitivity
66
Specificity
TNR = = TN/(FP+TN)
67
F-measure
2x (precisionxrecall)/precision+recall)
68
Maximizing number of correctly classified
Accuracy: Keep in mind base-rate/ f-measure: more refined measure, that balanced also false positives and false negatives
69
When optiminzing cost/benefit trade-off
Area-under-the-curve (AUC): correctly predict 'true positive' (profits) and minimize 'false negatives (losses) Do not use accuracy
70
When resources are limited:
Profit curve: find maximum within resource (budget) constraints. Lift curve: similar, but might go beyond maximum profit point
71
When maximizing profits
Profit curve: only way to see maximum profits and fraction to target
72
Conviction
how many more times x without y occur randomly compared to how many times x but not y occurred
73
Correlation
how likely are x and y to occur/ not to occur
74
Support
what is the probability of x and y occurring together
75
Confidence/strength
given x, how likely is y to occur
76
lift
how many more times do x and y occur together than we would expect by chance
77
Leverage
how much more likely do x and y occur together than we would expect by chance