BIS II - Data Mining Flashcards

1
Q

Data mining definition

A

o process that uses statistical, mathematical, AI and machine-learning techniques
o To extract and identify useful information and subsequent knowledge from large databases
• Datamining tools find patterns in data and may even infer rules/models from them
• Other names:
o Knowledge extraction; pattern analysis, knowledge discovery, information harvesting…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Mining Process

A

Different groups have different versions; most common standard processes are:

  • CRISP-DM (Cross-Industry Standard Process for Data Mining)
  • SEMMA (Sample, Explore, Modify, and Assess)

1) CRISP-DM Process: First 4 steps account for 85% of total project time; highly repetitive and experimental process
1. Develop Business Understanding
2. Then, develop Data Understanding
3. Prepare Data
4. Build model
5. Test and evaluate
6. Deploy

2) SEMMA Process
1. Sample: generate a representative sample of the data
2. Explore: visualize the data and make a basic description of it
3. Modify: select variables, transform the variable representations
4. Model: use a variety of statistical and machine learning models
5. Assess: Evaluate the accuracy and usefulness of the models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Preparation

A
  • Most critical task in DM
  • Steps involved in obtaining well-formed data from skewed data

1) Data Consolidation: collecting, selecting, and integrating data
2) Data cleaning: imputing missing values, reducing noise and eliminating inconsistencies in data
3) Data transformation: Normalizing data, discretizing & aggregating data, constructing new attributes
4) Data reduction; reducing number of variables and cases, balancing skewed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does Data Mining do? How does it work?

A
  • DM extracts patterns from data
  • Pattern = mathematical (numeric and/or symbolic) relationsip among data items

Types of patterns:

  1. Association
  2. Prediction
  3. Cluster (segmentation)
  4. Sequential (or time series) relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Applications of Data Mining

A
  • In Customer Relationship Management:
    1. To maximize return on marketing campaigns
    2. To improve customer retention
    3. To maximize customer value
  • Banking/financial
    1. To automate loan application process
    2. Detect fraudulent transactions
    3. To optimize cash reserves with forecasting
  • Etc. -> retail & logistics, manufacturing and maintenance, brokerage and securities trading, insurance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Mining Terminology

A

Data science/data mining = Statisctics/operations research
Features/attributes = Independent variables; Predictors;
Explanatory Variable
Target variable/attribute/label = Dependent variable
Bias = Intercept in regression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Taxonomy of Data Mining Tasks

A
  • Unsupervised learning aims at identifying associations, i.e. grouping data (to previously unknown classes)
  • Supervised learning: the classes are known
  • Different classification approaches differ regarding:
    1. Search strategy
    2. Efficiency with regard to resources
    3. Input data requirements
    4. Interpretability of results, generated rules/models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Mining Methods – Classification

A

Definition:

  1. Supervised induction used to analyze historical data stored in databased
  2. To automatically generate model that can predict future behavior
  • Most frequently used DM method
  • Employ supervised learning from past data, then classification of new data
  • Output variable is categorical (nominal or ordinal) in nature
  • Classification techniques:
    1. Decision tree analysis
    2. Artificial neural networks
    3. Logistic regression
    4. Support vector machines
    5. Etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Estimation Methodologies for Classification

A

Simple split (or holdout or test sample estimation)

  1. Split the data into 2 mutually exclusive sets training for model development (ca. 70%) and testing for model assessment/scoring (ca. 30%) to determine prediction accuracy
  2. For ANN, the data is split into three sub-sets (training, ca- 60%; validation, ca. 20% and testing, ca. 20%)

K-fold cross validation (rotation estimation)

  1. Split the data into k mutually exclusive subsets
  2. Use each subset as testing while using the rest of the subsets as training
  3. Repeat the experimentation k times
  4. Aggregate the test results for true estimation of prediction accuracy training

Other estimation methodologies:

  1. Leave-one-out, bootstrapping, jackknifing
  2. Area under the ROC curve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Accuracy of Classification Models

A
  • In classification problems, the primary source for accuracy estimation is the confusion matrix

Accuracy = True Positive Count + True Negative Count over all values
True Positive Rate = True Positive Count / True Positive + False Negative Count
True Negative Rate = True Negative Count / True Negative Count + False Positive Count
Precision = True Positive / True Positive + False Positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Decision Trees

A
  • Likelihood of a data subject to show a specific outcome of a target variable needs to be determined according to the attributes observed (basically correlation of occurrence of one attribute in concert with the target variable)
  • Question: “which of the attributes would be best to segment these people (in the example) into groups, in a way that will distinguish write-offs from non-write-offs?”
  • Looking for the most informative Attributes by creating a formula/algorithm that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable: ID3 decision tree algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Entropy and Information Gain

A
  • Information Gain = the most common splitting criterion
    1. Based on a purity measure = Entropy
  • Entropy
    1. Measure of disorder
    2. Disorder corresponds to how mixed (impure) the segment is with respect to the values of attribute of interest
    3. A Mixed-up segment with lots of realizations of both target variables (write-offs and non-write-offs) would have high entropy

Entropy = -p1log(p1) – p2log(p2) - …

  • Pi = the probability of value I within the set
  • Pi = 1, when all members of the set have attribute value I
  • Pi = 0 when no members of the set have attribute value I
  • There may be more than two attribute values (properties)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Information Gain

A

IG(parent, children) = entropy(parent) – p(c1) entropy (c1) – p(c2) entropy (c2) - …

  • Measures how informative an attribute is with respect to the target; how much an attribute decreases (improves) entropy over the whole segmentation it creates
  • An attribute segments a set of instances into several k subsets. Terminology:
    1. Parent set: the original set of examples
    2. K children sets are the result of splitting on the attribute values
  • The entropy of each child is weighted by the proportion of instances belonging to that child
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

To test the accuracacy of the model

A
  • Predict the values of the hold-out sample using the developed decision tree and compare it to the true values
  • Create a confusion matrix and compare the values for accuracy (hit rate), recall and precision of the prediction model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Summary: ID3 Decision Trees

A
  • General algorithm for building an ID3 decision tree:
    1. Create root node and select splitting attribute
    2. Add branch to root node for each split candidate value and label
    3. Take following iterative steps:
    1. Classify data by applying information gain measure
    2. If stopping point is reached, then create leaf node and label it. Otherwise, build another subtree
    4. Disadvantages of ID3:
    1. ID3 tends to prefer splits that result in large number of partitions, each being small but pure
    2. Overfitting, less generalization capability
    3. Cannot handle numeric values, missing values
    4. C4.5 algorithm aims at curing these shortcomings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly