Data Mining Flashcards

1
Q

Data Mining

A

Extraction of implicit, unknown, useful information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

KDD

A

Knowledge Discovery from Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive vs. Predictive

A

Descriptive extract interpretable models describing data, meanwhile predictive methods predict unknown or future values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of attributes

A

Nominal: Not ordered data (eye color, ID numbers)
Ordinal: Ordered data (grades, rankings, height)
Interval: By ranges (calendar dates)
Ratio: Scaled data (temperature in Kelvins, Length, time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Properties of attributes

A

Distinctness (D)
Order (O)
Addition (A)
Multiplication (M)

Nominal -> D
Ordinal -> D, O
Interval -> D, O, A
Ratio -> D, O, A, M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Discrete vs. Continuous

A

Discrete has a finite set of values, meanwhile continuous have infinite possible values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Quality Problems

A
  • Noise: Modification of original values.
  • Outliers: Data objects considerably different than most of the other data.
  • Missing Values: Data not collected or attributes not applicable for all objects.
  • Duplicate data: Merging problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Reduction Types

A
  • Sampling
  • Feature Selection / Dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Reduction Types

A
  • Sampling
  • Feature Selection / Dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Discretization

A

Supervised or unsupervised conversion of continuous values into a finite set of discrete values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Binarization

A

One hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Similarity and Dissimilarity

A

Two measures of how alike and different two data objects are.
[0-1] -> 1 refers to maximum _______ (similarity or dissimilarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of distance measures

A
  • Euclidean Distance
  • Minkowski Distance
  • Mahalanobis Distance
  • Cosine Similarity
  • SMC Similarity
  • Combined similarities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation

A

Measure between [-1, 1] of the linear relationship between two data objects.

corr(x, y) = covariance(x,y) / std_deviation(x)*std_deviation(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Association Rules

A

Extraction of frequent correlations or pattern from a transactional database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AR: Itemset

A

A set of items.
Example: {Beer, Diapers}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AR: Support

A

Fraction of transactions that contain an itemset.
Example: sup({beer, diapers})=2/5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

AR: Frequent Itemset

A

An itemset whose support is greater or equal to a minus threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

AR: Confidence

A

Frequency of {A, B} in transactions containing A, where A=>B.
Example: conf=sup(A,B)/sup(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Association Rule Extraction Steps

A
  1. Extract frequent item sets
  2. Extract association rules
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AR: Brute-force approach

A

Generate all possible item sets and extract association rules. Computationally unfeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

AR: Apriori Principle

A

If an itemset is frequent, then all of its subsets must also be frequent.
Example: {A,B} -> Frequent
{A} and {B} must also be frequent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AR: Apriori Algorithm

A
  1. Extract 1 element itemsets
  2. Prune item sets below minusup
  3. Generate new possible item sets.
  4. Loop
    Draw it or create an example.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Association Rule Extraction Algorithms

A
  • Brute-force approach
  • Apriori Algortihm
  • FP-growth algorithm
25
AR: FP-growth Algorithm
1. Build header table by sorting items (NOT item sets) in decreasing support order. 2. Add header itemsets to frequent if support is greater than minsup. 3. FP-tree construction 4. Select lowest item in header table. 5. Create new transaction with FP-tree. 6. Repeat, exploring all items in the latest header table.
26
AR: Correlation
r: A => B conf(r) / sup(B) Statistical independence -> correlation = 1 Positive correlation -> correlation>1 Else negative.
27
Accuracy
Quality of the prediction
28
Interpretability
Interpretable?
29
Incrementality
Model updating in presence of new labelled records
30
Efficiency
Building and prediction times
31
Scalability
Training set size and attribute number
32
Robustness
How it reacts to noise and missing data
33
Gini Index Formula
NODES: Gini(N1) = 1 - sum[P(C)^2] Example: C1 1 C2 5 P1=1/6 P2=5/6 1-P1^2-P2^2-... SPLITS: Gini(split) = SUM[(#NodeKRecords/#SplitRecords)*Gini(Nk)
34
Decision Tree Elements
Node: Final prediction node Split: Nodes containing another attribute
35
Gini Index
Measures impurity of a node or split. 0: low impurity 1/#records: high impurity
36
Gini Index for continuous values
1. Calculate Gini index for each range 2. Select one range between each continuous value.
37
Entropy Index
Entropy = - SUM[P(Ck) log2 P(Ck)]
38
Decision Tree Evaluation
Accuracy: Average Interpretability: For small trees Incrementality: Not incremental Efficiency: Fast Scalability: Scalable Robustness: Difficult management of missing data
39
Random Forest
Divides original training data in random subsets, then develops a decision tree for each subset and label is decided by majority voting.
40
Random Forest Evaluation
Accuracy: Better than decision trees Interpretability: No Incrementality: No Efficiency: Fast Scalability: Scalable Robustness: Robust
41
Rule Based Classifiers
Classify records by using a collection of “if...then...” rules. (Condition) → y
42
Rule Based Classifier Evaluation
Accuracy: Better than decision trees Interpretable Not incremental Efficient Scalable Robust
43
Instance Based Classifiers
Store training records and use them to predict the class label of unseen cases.
44
K-nearest Neighbors
Selects label based on the k nearest neighbors
45
K-nearest Neighbor Evaluation
Accuracy: Average Not interpretable Incremental Efficient building but slow classification Scalable Distances might affect robustness
46
Bayesian Classification
Computes probability that a given record belongs to C, calculated using Bayes theorem. X: Record C1: class 1 C2: class 2 For C1: P(X1|C1)*P(X1|C1)*...*P(C1) P(X1|C1) = #ofX1s / #ofC1s P(C1) = #ofC1s / #ofRecords
47
Bayes Classifier Evaluation
Accuracy: Average Not interpretable Incremental Efficient Scalable Robust for statistically independent attributes
48
SVM Evaluation
Good accuracy Not interpretable Not incremental Averagely efficient Medium scalability Robust
49
Cross validation
1. Partitioning data into K subsets 2. Train on K-1 partitions and test on remaining one 3. Repeat for all subsets (or folds)
50
Confusion Matrix
Table with true positives...
51
Accuracy
correct/number of classified objects Not appropriate for unbalanced class label distributions or classes with different relevances.
52
Recall
Number of objects correctly assigned to C / Number of objects belonging to C
53
Precision
Number of objects correctly assigned to C / Number of objects assigned to C
54
ROC
Receiver Operating Characteristic True Positive vs. False Positive curve
55
K-means clustering
1. Selects K random points as centroids 2. Assigns points to the closest centroid 3. Define a new centroid, normally the mean 4. Iterate until fixed centroids
56
Bisecting K-means Clustering
1. Split in areas 2. Split area with the highest SSE 3. Repeat until having K clusters
57
Hierarchical Clustering
Construct a hierarchical tree.
58
DBSCAN
Density based clustering