Test 1 ISYS 4293 Flashcards

(80 cards)

1
Q

Business Intelligence and Data Mining

A

Data mining is a collection of knowledge-discovery technologies used to perform Business Intelligence in order to support an organization’s decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cross Industry Standard Process-DM

A

is how we do data mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(1) Business Problem Understanding

A

-Define business requirements and objectives
-Translate objectives into data mining problem definition
-Prepare initial strategy to meet objectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(2) Data Understanding Phase

A

-Collect data
-Assess data quality
-Perform exploratory data analysis (EDA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(3) Data Preparation Phase

A

-Cleanse, prepare, and transform data set
-Prepares for modeling in subsequent phases
-Select cases and variables appropriate for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(4) Modeling Phase

A

-Select and apply one or more modeling techniques
-Calibrate model settings to optimize results
-If necessary, additional data preparation may be required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(5) Evaluation Phase

A

-Evaluate one or more models for effectiveness
-Determine whether defined objectives are achieved
-Make decision regarding data mining results before deploying to field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(6) Deployment Phase

A

-Make use of models created
-Simple deployment: generate report
-Complex deployment: implement additional data mining effort in another department
-In business, customer often carries out deployment based on model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How many data mining tasks?

A

6 data mining task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Mining Task: Description

A

-Describes general patterns and trends
-Easy to interpret and explain
-Transparent Models
-Pictures and #’s
-E.g. Scatterplots, Descriptive Stats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Mining Task: Estimation

A

-Target Variable = Numerical
-Numerical Predictor/Categorical (IV’s) values to approximate changes in Numerical Target Variables(DV’s)
-Ex: Estimate a student’s Graduate GPA from their Undergrad GPA
-E.g. Correlation, Linear Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Mining Task: Classification

A

-target variables (DV’s) = categorical
-Examples:
Simple vs Complex tasks
Fraudulent card transactions
Income brackets(ex. high, middle, low)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Mining Task: Prediction

A

-Results lie in the future
-There is a time component in this task
-Ex: What is the probability of Razorbacks winning a game with a particular combination of player profiles?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Mining Task: Association

A

-Finding attributes of data that go together
-Profiling relationships between two or more attributes
-Understand the consequent behaviors when based on prior behaviors
-Ex: Supermarkets use affinity analysis to see what items are purchased together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Mining Task: Clustering

A

-no target variables
-segmentation of data
-Ex: Focused marketing campaigns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data mining Task: Learning Types

A

Supervised and Unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Supervised

A

-Have a target variable
-Task:
Classification(Categorical Target Variable)
Estimation (Numeric Target Variable)
Description
Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Unsupervised

A

-No target variable
-Task:
Association
Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fallacy 1:
-Set of tools can be turned loose on data repositories
-Finds answers to all business problems

A

Reality 1:
-No automatic data mining tools solve problems
-Rather, data mining is a process (CRISP-DM)
-Integrates into overall business objectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Fallacy 2:
-Data mining process is autonomous
-Requires little oversight

A

Reality 2:
-Requires significant intervention during every phase
-After model deployment, new models require updates
-Continuous evaluative measures monitored by analysts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fallacy 3:
-Data mining quickly pays for itself

A

Reality 3:
-Return rates vary
-Depending on startup, personnel, data preparation costs, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fallacy 4:
Data mining software easy to use

A

Reality 4:
-Ease of use varies across projects
-Analysts must combine subject matter knowledge with specific problem domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Fallacy 5:
Data mining identifies causes of business problems

A

Reality 5:
-Knowledge discovery process uncovers patterns of behavior
-Humans interpret results and identify causes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fallacy 6:
-Data mining automatically cleans data in databases

A

Reality 6:
-Data mining often uses data from legacy systems
-Data possibly not examined or used in years
-Organizations starting data mining efforts confronted with huge data preprocessing task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Fallacy 7: -Data mining will always yield positive results
Reality 7: -Not guaranteed for positive results -Can sometimes provide actionable results and improve decisions, but not always
26
Data preparation
60% of effort for data mining process
27
Data Cleaning
-Replacement Missing Value -Normalization, converting variables to standardized scale -Testing for Normality -Dummy Variables -Outliers
28
Why Preprocess data
-Raw data may often be incomplete, noisy -Data often from legacy databases where values are missing or non relevant -Data in form not suitable for data mining; Obsolete fields; Outliers
29
Three Alternate Methods For Replacing Data
-Replace Missing Values with User-defined Constant -Replace Missing Values with Mode or Mean/Median -Replace Missing Values with Random Values
30
Replace Values with User-Defined Constant
-Missing numeric values replaced with 0.0 -Missing categorical values replaced with "Missing
31
Replace Missing Values with Mode or Mean/Median
-Mode for categorical field -Mean/Median for continuous field
32
Replace Missing Values with Random Values
-Values randomly taken from underlying distribution -Method superior compared to mean substitution -Measures of location and spread remain closer to original
33
Data Transformation: Normalization
-Standardizes scale of effect each variable has on results and The mean and variance or range of every variable (numeric field values should be normalized)
34
Min-Max Normalization
-Determines how much greater field value is than minimum value for field -Scales this difference by field's range -X* stands for "min-max normalized X"
35
Z-score Standardization
-Widely used in statistical analysis -Takes difference between field value and field value mean -Scales this difference by field's standard deviation -Range [-3,3] -Data values equal to field's mean have z-score Standardization value = 0 -Data values that lie above the mean have positive z-score Standardization values
36
In Z-score Standardization: -Data values equal to field's mean
have z-score Standardization value = 0
37
In Z-score Standardization: -Data values that lie above the mean
have positive z-score Standardization values
38
In Z-score Standardization: -Data Values that lie below mean
Have negative z-score Standardization Values
39
Normality
to transform variable so that its distribution is closer to normal without changing its basic information
40
Data Transformation: Normality
Common transformations: -Natural log = ln(bank) -Square root = βˆšπ΅π‘Žπ‘›π‘˜ -Inverse square root = 1/βˆšπ΅π‘Žπ‘›π‘˜
41
Right-skewed data
mean > median; skewness is positive
42
Left-skewed data
mean < median; skewness is negative
43
Symmetric data
mean = median = mode; skewness is zero
44
Outliers
-values that lie near extreme limits of data range -Outliers may represent errors in data entry
45
Z-score Standardization
sensitive to outliers
46
Min Max normalization
sensitive to variation
47
Interquartile Range (IQR)
-Used to identify Outliers -Robust statistical method and less sensitive to presence of outliers -measure of variability
48
Hypothesis
A statement or claim about a parameter
49
Null Hypothesis
represents assumed value
50
Alternative Hypothesis
represents alternative claim about the value
51
Statistical Inference
Methods for estimating and testing hypotheses about population characteristics based on information contained in a sample
52
A data analyst meets with superiors to discuss whether to use kNN or Association on the data
Modeling Phase(Still discussing which model to use)
53
Chief Analyst meets with CIO, who says that she would like to investigate and scope out how analytics can be used in HR hiring projects?
Business Understanding Phase(look at investigate and scope out)
54
Estimate the amount of money a randomly chosen family of 4 will be shopping given a time and date?
estimation or prediction (target variable, categorial or continuous, numerical or nonnumerical)
55
Forecast the stock price of Microsoft for next year?
estimation or prediction
56
What does this equation represent? zi= (xi-x_)/s
z-score equation zi= zscore xi= observed value x_ = mean of sample s= standard deviation
57
Simple linear regression equation
𝑦=𝜷_𝟎+𝜷_𝟏 π‘₯+𝜺
58
What is the use of standardizing variables?
Automatically remove outliers in the variables * Convert variables to a same scale * * Helps in computing IQR * Make interpretation of the results easier
59
When Handling Missing Data, one could,
– Replace Missing Values with User-defined Constant – Replace Missing Values with Mode or Mean/Median – Replace Missing Values with Random Values – All of the above *
60
IQR is more robust than Z-score method for outlier detection, however, it is highly sensitive to mean and standard deviation
1. True 2. False * (look at highly sensitive) 3. It is depends on the context 4. It is depends on the observations 5. Only 3 and 4
61
Is IQR or zscore more sensitive
zscore
62
In data mining tasks, one could reduce the margin of errors by…
Reducing the sample size – Increasing the sample size * – Changing the standard deviation – Keeping the sample size constant
63
Normalization of the data can be done using
None of the above (min-max equation) x β€² = ( x βˆ’ x m i n ) / ( x m a x βˆ’ x m i n )
64
In Forward Regression, you start with all variables of interest in the model and then at each step, the least significant variable is dropped, assuming it’s p-value is above a pre-set level (Ξ± = .05 or .10)
false
65
Before running a k-nearest neighbor model it is required to
set the number of neighbors to compare instances to
66
In k-nearest neighbor, distance for categorical variable can be computed by
Different function
67
Choose appropriate fit statistics for estimation model selection:
Misclassification – Gini Coeff – Average Squared Error * – Schwarz’s Bayesian Criterion * – Average Profit/Loss – Log Likelihood *
68
appropriate fit statistics for rankings model
ROC Index * Gina Coefficient *
69
Choose appropriate fit statistics for decision model selection:
Misclassification * – Gini Coeff – ASE – MSE – Average Profit/Loss * – Log Likelihood - kolmorgov smirnov statistic *
70
Which is true when modeling a Decision Tree?
Each variable is evaluated at each node to determine the splitting variable * The same variable may be used for splitting at different locations in the Decision Tree * CART (Phi) / information gain criteria can be used for selecting candidate splits * If not pruned, a stopping criterion in creating a Decision Tree is when the tree reaches the leaf nodes * All of the above *
71
Categorical data
- Labels or names used to identify an attribute of each element - Generally qualitative - Nominal or ordinal
72
Quantitative data
- indicates how many or how much - Either discrete or continuous
73
The sum of differences between xi and xbar =
0
74
Variance
- measures how far a set of numbers are spread out from their average value. Xi - Xbar = Varianc
75
Overfitting
-when your model memorizes your exact training data but doesn't figure out the pattern in the data -Fits the model too much
76
Underfitting
is when you model is too simple -model isn't complex enough to match the training data
77
continuous variable
use the two-sample t test for the difference in means
78
flag variable
use the two-sample Z test for the difference in proportions
79
multinomial variable
use the test for the homogeneity of proportions
80
goodness of fit equation:
Ξ¦(s|t) = 2PlPr βˆ‘|P(j|tL) - P(j|tR)|