Test 1 ISYS 4293 Flashcards

Question

Fallacy 7: -Data mining will always yield positive results

Answer 1

Reality 7: -Not guaranteed for positive results -Can sometimes provide actionable results and improve decisions, but not always

Answer 2

60% of effort for data mining process

Answer 3

-Replacement Missing Value -Normalization, converting variables to standardized scale -Testing for Normality -Dummy Variables -Outliers

Answer 4

-Raw data may often be incomplete, noisy -Data often from legacy databases where values are missing or non relevant -Data in form not suitable for data mining; Obsolete fields; Outliers

Answer 5

-Replace Missing Values with User-defined Constant -Replace Missing Values with Mode or Mean/Median -Replace Missing Values with Random Values

Answer 6

-Missing numeric values replaced with 0.0 -Missing categorical values replaced with "Missing

Answer 7

-Mode for categorical field -Mean/Median for continuous field

Answer 8

-Values randomly taken from underlying distribution -Method superior compared to mean substitution -Measures of location and spread remain closer to original

Answer 9

-Standardizes scale of effect each variable has on results and The mean and variance or range of every variable (numeric field values should be normalized)

Answer 10

-Determines how much greater field value is than minimum value for field -Scales this difference by field's range -X* stands for "min-max normalized X"

Answer 11

-Widely used in statistical analysis -Takes difference between field value and field value mean -Scales this difference by field's standard deviation -Range [-3,3] -Data values equal to field's mean have z-score Standardization value = 0 -Data values that lie above the mean have positive z-score Standardization values

Answer 12

have z-score Standardization value = 0

Answer 13

have positive z-score Standardization values

Answer 14

Have negative z-score Standardization Values

Answer 15

to transform variable so that its distribution is closer to normal without changing its basic information

Answer 16

Common transformations: -Natural log = ln(bank) -Square root = √𝐵𝑎𝑛𝑘 -Inverse square root = 1/√𝐵𝑎𝑛𝑘

Answer 17

mean > median; skewness is positive

Answer 18

mean < median; skewness is negative

Answer 19

mean = median = mode; skewness is zero

Answer 20

-values that lie near extreme limits of data range -Outliers may represent errors in data entry

Answer 21

sensitive to outliers

Answer 22

sensitive to variation

Answer 23

-Used to identify Outliers -Robust statistical method and less sensitive to presence of outliers -measure of variability

Answer 24

A statement or claim about a parameter

Answer 25

represents assumed value

Answer 26

represents alternative claim about the value

Answer 27

Methods for estimating and testing hypotheses about population characteristics based on information contained in a sample

Answer 28

Modeling Phase(Still discussing which model to use)

Answer 29

Business Understanding Phase(look at investigate and scope out)

Answer 30

estimation or prediction (target variable, categorial or continuous, numerical or nonnumerical)

Answer 31

estimation or prediction

Answer 32

z-score equation zi= zscore xi= observed value x_ = mean of sample s= standard deviation

Answer 33

𝑦=𝜷_𝟎+𝜷_𝟏 𝑥+𝜺

Answer 34

Automatically remove outliers in the variables * Convert variables to a same scale * * Helps in computing IQR * Make interpretation of the results easier

Answer 35

– Replace Missing Values with User-defined Constant – Replace Missing Values with Mode or Mean/Median – Replace Missing Values with Random Values – All of the above *

Answer 36

1. True 2. False * (look at highly sensitive) 3. It is depends on the context 4. It is depends on the observations 5. Only 3 and 4

Answer 37

Reducing the sample size – Increasing the sample size * – Changing the standard deviation – Keeping the sample size constant

Answer 38

None of the above (min-max equation) x ′ = ( x − x m i n ) / ( x m a x − x m i n )

Answer 39

set the number of neighbors to compare instances to

Answer 40

Different function

Answer 41

Misclassification – Gini Coeff – Average Squared Error * – Schwarz’s Bayesian Criterion * – Average Profit/Loss – Log Likelihood *

Answer 42

ROC Index * Gina Coefficient *

Answer 43

Misclassification * – Gini Coeff – ASE – MSE – Average Profit/Loss * – Log Likelihood - kolmorgov smirnov statistic *

Answer 44

Each variable is evaluated at each node to determine the splitting variable * The same variable may be used for splitting at different locations in the Decision Tree * CART (Phi) / information gain criteria can be used for selecting candidate splits * If not pruned, a stopping criterion in creating a Decision Tree is when the tree reaches the leaf nodes * All of the above *

Answer 45

- Labels or names used to identify an attribute of each element - Generally qualitative - Nominal or ordinal

Answer 46

- indicates how many or how much - Either discrete or continuous

Answer 47

- measures how far a set of numbers are spread out from their average value. Xi - Xbar = Varianc

Answer 48

-when your model memorizes your exact training data but doesn't figure out the pattern in the data -Fits the model too much

Answer 49

is when you model is too simple -model isn't complex enough to match the training data

Answer 50

use the two-sample t test for the difference in means

Answer 51

use the two-sample Z test for the difference in proportions

Answer 52

use the test for the homogeneity of proportions

Answer 53

Φ(s|t) = 2PlPr ∑|P(j|tL) - P(j|tR)|

Test 1 ISYS 4293 Flashcards

(80 cards)