final exam Flashcards

1
Q

The classification trees classification algorithm:

  • estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
  • uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
  • predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
  • utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The naive Bayes classification algorithm:

  • estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
  • uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
  • predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
  • utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

the knn classification alg:

  • estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
  • uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
  • predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
  • utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

classification algorithms that do not use assumptions abt the structure of teh data are ___ algorithms

A

data driven

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

a good use of classification alg would be:

  • estimating the net profit for dishwashers for a major manufacturer
  • identifying the seasonal salws for wood stoves over the last 3 yrs
  • forecasting sales for a new product
  • upselling or cross selling to cuts thru an online store when a cust makes a purchase
A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

in a CART model classification rules are extracted from

A

the decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

the knn techique is what type of technique

A

a classification technique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

in setting up the knn model:

  • the user allows XLminer to select the optimal value of k
  • the optimal k is set by the user at 10
  • the data is normalized in order to take into account the categorical variables
  • it is necessary to set an optimal value for k
A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Below are the 8 actual values of the target variable in the training position:
(0,0,0,1,1,1,1,1)
What is the entropy of the target variable?
-5/8 log2(5/8)-3/8 log2(3/8)
5/8 log2(5/8)-3/8 log2(3/8)
-3/8 log2(3/8)+5/8 log2(3/8)
-5/8 log2(3/8)+log2(5/8)

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Classification programs are distinguished from estimation problems in that

  • classification problems require the output attribute to be numerical
  • classification problems require the output attribute to be categorical
  • classification problems do not allow an output attribute
  • classification problems are designed to predict future outcomes
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which statement is true about the decision tree attribute selection process:

  • a categorical attribute may appear in a tree node several times but a numeric attribute may appear at most once
  • a numeric attribute may appear in several tree nodes but a categorical attribute may appear at most once
  • both numeric and categorical may appear in several tree nodes
  • numeric and categorical attributes may appear in at most 1 tree node
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
What is the ensemble enhancement that is a method of creating psudo-data from the data in an og data set?
partitioning
overfitting
sampling
bagging
A

bagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
What is the ensemble enhancement that is an iterative technique that adjusts the weight of any record based upon the last classification
bootstrapping
boosting
sampling
bagging
A

boosing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the most often used ensemble enhancement

A

bagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 most popular methods for creating ensembles?

  • sampling, summarizing, random forest
  • bagging, boosting, random forest
  • bagging, boosting, clustering
  • overfitting, clustering, sampling
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one benefit of using an ensemble model?

  • it better establishes the relationship bw 1 dep. varaible and multiple ind. variables
  • it strengthens the relationship bw the multiple ind. var
  • it reduces the number of errors that results
  • it is more efficient at adding and removing predictors
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the most common uses of clustering algorithms?

  • to min variance and bias error
  • to segment cust
  • to determine how effectively the model can reorder the data set
  • to validate the data set
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

in logit P/(1-p) represents:

A

the odds of sucess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In a naive bayes model it is necessary that:
-all attributes are categorical
-to partition the data into 3 parts (training, validation, scoring)
-to set cutoff values to less than .75
to have a continuous target variable

A

1 (ie gender, blood type); can never have cont. variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Generally, an ensemble method works better, if the individual base model have _____
Assume each indiv. base models have accuracy greater than 50%
-less correlation among predictors
-high correlation amond predictors
-correlation does not have any impact on ensemble output
-none of the above

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
a dendogram is used w which analytics algorithsm?
text mining
clustering
ensemble models
all of the above
A

clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a bootstrap?

  • procedure that allows the data scientists to reduce the dimensions of the training data set
  • this is one of many classification type algorithms
  • it is a procesure for aggregating many attributes into a few attributes
  • it is based on repeatedly and systematically sampling w/out replacement from the data
A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is clustering

  • ensemble algorithm for improving the accuracy of classification models
  • could be thought of as a set of nested algorithms whose purpose is to choose weak learners
  • it is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another
  • none of the above
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which of the following are not types of clustering?

  • k means
  • hierarchal
  • agglomerative
  • splitting
A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

a major part of text mining is to

  • reduce the dimensions of the data
  • generalize the use of modifiers
  • screen the articles from the data set
  • reduce the word count of the text actually used
A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

semantic processing seeks to

  • extract meaning
  • group indiv. terms into bins
  • eliminate “extra” or unnecessary terms from an analysis
  • uncover undefined words or terms in a set of textual data
A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

what is the process of extracting token words from a block of text after performing cleanup procesures

A

tokenization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What would normalized text look like?

  • all duplicate words are removed
  • all stop words removed
  • all spelling errors corrected
  • all text is converted to lower case
A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What would the result be if you were asked to use stemming to these terms: agreed, agrees, agreeable, agreeing?

A

all terms would change to agree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What type of standard diagnostic is used for text mining algorithms?

A

lift chart and confusion matrixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

A model that goes beyond a bag of words analysis and assigns and defines consumer sentiment to words would be a ___ model

A

NLP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Which of the following are other procedures that could be used to reduce the text dimensions to prepare for analysis?

  • # and items that appear to be monetary values are removed
  • words of more than 20 letters in length are removed
  • headers and page numbers are removed
  • duplicates of all words are removed
A

1,2,3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is entity extraction?

A

identifying a group of words as a single item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

the words extracted from a black of text after the cleanup procedures have been performed are

A

tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

latent semantic indexing:

  • uses svd to identify patterns in the relationship bw terms and concepts
  • reduces the dimensions of the text by trateing all versions of the same (or a very similar) cncept identically
  • ccollates the most common words and phrases and identifies them as keywords
  • identifies a group of words as a single item
A

1,3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

what is a method for clearing away clutter in raw tect documents and extracting useful char. to serve as attirbutes?

A

dimension reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What algorithm takes a large # of words and compresses them into a much smaller number of linear combinations

A

SVD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Which of the folliwng best describe target leakage?

  • it is diff to detect and harder to eliminate
  • it is the diff bw the expected prediction of a model and the correct value that is targeted
  • it allows alg to make predictions that are too good to be true
  • it is the intro of info about the text mining target that hsould not legit be available to the alg.
A

1,3,4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

the process of collecting data from websites is

A

web scraping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the goal of text analytics?

to reduce the dimenstion of the ___ text to manageable attributes that ca be used in data mining alg

A

unstructured

41
Q

What is the most diff decision to make when considering the use of text as data?

  • to know the attributes of the data
  • the format of the data
  • to define the prob you are trying to solve
  • teh data mining alg that will be used
A

3

42
Q

what looks at unprocessed text as a collection of words w./out regard to grammar

A

bag of words

43
Q

Which of the following describe stemming?

  • stem would be the token remembered and used in place of all other forms of this word
  • stemming reduces words to their stem
  • a stem word needs to be a real word, not a made up word
  • reduces the dimensions of the text by treating all versions of the same concept identically
A

1,2,4

44
Q

what uses svd to identify patterns in the relationships bw the terms and concpets

A

latent semantic indexing

45
Q

“the erupting guyser with diet coke and memtos is a fun experiment for kids working at home”
what is the final step in dimension reduction?
-correct the spelling
-eliminate the stop words
-identify associations
-perform stemming

A

1

46
Q

“the erupting guyser with diet coke and memtos is a fun experiment for kids working at home”
what is the 2nd step in dimension reduction?
-correct the spelling
-eliminate the stop words
-identify associations
-perform stemming

A

4

47
Q

What two main problems is the random forest methodology used to address?

  • settings where the # of attributes is much smaller than the # of records
  • assess and rank attributes w respect to their ability to predict the classification
  • construct classification rules for a learning problem
  • yield a classification given the attributes for current observations
A

2,3

48
Q

What clustering alg is an iterative process using some best criteria in the multiple passes to see if it can improve the clusters?
k-means
hierarchial method

A

k-means

49
Q

What are the two basic types of clustering algorithms?

A

k-means and hierarchial

50
Q

Which of the following would be correct about the random forest algorithm?

  • it is limited to a random subset of attributes at each stage of the alg
  • it is based on applying bagging to a decision tree algorithm that also samples attributes in addition to records
  • it produces more accurate predictions than a simple CART
  • it is a collection of 1 cart tree that is ind. when constructed
A

1,2,3

51
Q

What is the most common form of unsupervised learning?

A

clustering

52
Q

What is a decision stump? It is a decision tree with _____ root nodes

A

1

53
Q

Which of the following correctly compare the relationship bw traditional CART model and random tree?

  • over fitting can be a prob. w both CART and rando forest
  • Rando forest produces more accurate predictions than a simple CART
  • Random forest samples records like CART but it also samples the attributes
  • CART and random forest have the same number of steps
A

1,2,3

54
Q

What is the k-means clustering alg?

A

iterative process using some of the best criteria in the multiple passes to see if it can improve the clusters

55
Q

If the correlation coefficient is .983, it indicates that

A

there appears to be a strong positive linear association bw x and y

56
Q

What collates common words and identifies them as keywords?

A

latent semantic index

57
Q

3 steps in dimension reduction

A
  1. eliminate stop words
  2. perform stemming
  3. correct spelling errors
58
Q

What reduces phrase to basic identity?

A

phrase reduction

59
Q

what is the process of extracting “token” words from a block of text after performing cleanup?

A

tokenization

60
Q

What takes a large number of words and compresses them into a small number of linear combinations

A

sing. value decomp

61
Q

error due to the diff bw the expected prediction of our model and the correct value were trying to predict

A

bias

62
Q

error due to the variability of a models prediction for a given point

A

variance

63
Q

what clustering technique adjusts the weight of any given record based upon the last classification

A

boosting

64
Q

what is a method of creating psudo-data from data in an original set

A

bagging; boot strap

65
Q

What are the 4 types of classification models?

A

KNN, classification trees (CART, decision tree, regression tree), naive bayes, logit

66
Q

the naive bayes classification thecnique could best be describe by:

  • its comparable in performance to decision trees
  • is fast and accurate
  • uses statistical classifiers
  • uses the same alg. as the regression decision tree
A

1,2,3

67
Q

How do data mining trees look?

A

upside down tree w leaves at bottom and root on top

68
Q

A good classification tree will make the best split first followed by decision rules that are made up w:

  • succesively larger and larger numbers of training records?
  • succesively smaller and smaller numbers of training records?
A

2

69
Q

The most important distinction bw a logit and ordinary regression is that the dependent variable is

A

categorical not continuous

70
Q

classification is ___ learning

A

supervised

71
Q

What is a mathematical concept that measures the uncertainty associated w random variables

A

info entropy

72
Q

What does the k in KNN refer to

  • # of nearest neighbors used in determining a category correctly
  • # of char. of the unknown used in determining a category correctly
  • # of unknowns in a category
  • # of attributes nearest the unknown
A

1

73
Q

What classification technique predicts numeric quantities?

A

classification tree

74
Q

What are the steps in the data mining process?

A
Sample
Explore
Modify
Model
Assess
75
Q

What are the 4 characteristics of data mining

A
  1. volume- large sizes of data bc w analytics can use unstructured data; size of data set
  2. velocity- rate at which we expect the data to arrive has to be fast. Algorithms must be fast in order to be useful
  3. variety- unstructured and structured data. Not just data from excel sheets; types of data available
  4. value- a lot of the data in the past was not useful. Job of data scientist to determine what is useful data and what is not; valuation of data
76
Q

What is the common form of prediction in data mining

A

classification tool

77
Q

What is the primary goal for data mining?

  • to model the noise in the data
  • to overfit the model so there’s a low misclassification rate
  • to have accuracy and fit as characteristics of the model
  • to do a good job of representing our known data set
A

3,4

78
Q

Which correctly define data warehouse

  • a firms central repository of integrated historical data
  • a location where data is stored
  • the memory of the firm
  • collective info on every aspect of what has happened in the past
A

1,3,4

79
Q

Which of th following describes a data mart:

  • collective info on every aspect of what has happened in the past for a comp
  • holds info that is specialized and has been grouped or chosen specifically
  • a frims central repository of integrated hist. data
  • a subset of a data warehouse
A

2,4

80
Q

What does data mining refer to

  • the physical tools used to access data and make predictions
  • knowlege gained from mass data
  • patterns in a mass of data
  • tools that are used in the large scale or big data arena
A

4

81
Q

Which of the following are true regarding data mining and biz forecasting?

  • in data mining u simultaneous search for diff patterns in parallel, but in biz forecasting search for set patterns
  • for biz forecasting, the expectation is that the data will contain some level of variation where in data mining patterns are not pre specified
  • in data mining you are searching for seasonal variability, but in biz forecasting yu are searching for trend patterns only
A

1,2,3

82
Q

Which of the following correctly contrasts data mining w database management?

  • queries are well defined in database mngment but less structured in data mining
  • data mining is more forward looking where database mngment is more past focused
  • a query in database mngmt would be “find all cust in atlanta”, in data mining it would be “group all cust w sim buying habits”
  • database mngment is extracting useful info from large, unstructured databases where dataminign is extracting specialized or grouped data
A

1,2,3

83
Q

classification tools distinguish bw:

  • data concepts and objects
  • data objects and classes
  • data classes or fields
  • data classes or concepts
A

4

84
Q
target
algorithm
feature/attribute
record
score
data mining terms
A
dep var
forecasting model
exp variable
observation
forecast
85
Q

What are the reasons for sampling or partitioning in data mining?

  • it is common practice in database management to set aside a portion of the data
  • partitioning and testing for accuracy are standard practice in analytics
  • it has its roots in the “holdout” and “holdbacks” used for standard forecasting models
  • in most cases, the entire data set is not needed to build a model
A

2,3,4

86
Q

what is the process called of transforming text into numbers

A

datafication

87
Q

bag of words analysis looks at what?

A

unprocessed text as a collection of words w/out regard to grammar

88
Q

the frequency of any single word is inversely propotional to its rank in the frequency table is what law?

A

zipfs law

89
Q

what are the most common mistake in text mining?

A

-target leakage: intro of info that should not be available to the alg

90
Q

In the universal bank data in this chapter, only 10% of the records represented customers who had taken out a personal loan (the target variable). If we were to score a new cust based upon the attributes we used in the alg we would be accurate in the prediction about 90% of the time if we always scored the indiv as “not accepting a loan” bc that is indeed what most cust have done in the past. WHy not accept being right 90% of the time w this v simple decision rule?

A

Because with data mining we have access to the information and tools that can help us do better than predicting correctly 90% of the time. So in this scenario, we could look at our lift chart and find the customers with the highest probability of accepting a personal loan, and market to them, in order to have a better chance of finding people who will accept the loan.

91
Q

Data has the characteristic of being nonrivalrous. Explain its importance

A

Nonrivalry: characteristic that means that one person’s use of the good to create value does not diminish the value another can extract from the data.
It’s important to realize that data has this characteristic so more researchers and data scientists can use data, because every time the data set is used, it can be used to obtain different results. Every researcher can use a data set with a different purpose and get different conclusions.

92
Q

The lift chart and the confusion matrix are both standard diagnostic tools used to evaluate a data mining algorithm. Don’t the two measures display the same info? explain diff bw the two measures

A

Both the confusion matrix and the lift chart provide information about model performance but display the information in different ways.
Confusion matrix: This shows model performance. There is a confusion matrix for both the validation data and the training data. Most often, the results from the validation model are most relevant since they show how the model performed on unseen data. The validation confusion matrix shows model performance in classification on data that was not used to build the model. Gives results for the amount of correct classifications and the misclassifications.
Lift chart: This is the standard for accuracy in data mining. These charts help to determine how effectively the model can reorder the data set, by placing the individuals who have the highest probability of success on top, and those with the lowest probability of success on bottom. By looking at the chart, you can determine how well your model is doing compared to a naïve model.
confusion- what u got right and wrong
lift- for each % of data, how much u got right

93
Q

Why do we need to sample the data?

A

Bc we need to see how model works on this current data set and how it will work in the real world
The risk in ignoring this step is creating bias. If a data scientist uses the same data to both build and test the model, and that model is overfit, then most likely the results will also be overfit.

94
Q

Structured vs unstructured

A

Structured data: data that does have a predefined model
Unstructured data: Data that does not have a predefined data model
Unstructured data is a more prevalent form of data because it comes in many different forms which we are expose to daily
Excel spreadsheet: structured data
A thousand text files: unstructured
A thousand video images: unstructured
A thousand audio filed: Unstructured

95
Q

Some data mining algorithms work so well they have a tendency to overfit the data. What does this mean and what difficulties does overlooking it cause for the data scientist?

A

Overfitting: When we put too many attributes (or try to account for too many patterns) in a model, including some unrelated to the target.
If a data scientist overfits their data they will incorrectly explain some variation in the data that is nothing more than a chance variation. In other words, they will have mislabeled the noise in the data as part of the “true signal”

96
Q

To find if the coefficients are statistically significant..

A

t-stat>2= stat. sig

97
Q

What does R2 say?

A

Explains ___% of the variation in the data

98
Q

How do you monitor autocorrelation?

A

if DW is bw 1.5 and 2.5 you have no 1st order serial correlation