Exam 2 Flashcards

1
Q

What does the straight line on the lift chart represent

A

expected number of positives on any class we would predict if we used the naive model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a validatation set used to do?

A

compare models and pick the best one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Estimating a model that explains the training set data points perfectly and leaves little error but that is unlikely to be accurate in prediction is

A

overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

With most data mining techniques, why do we partition the data..

A

in order to judge how out model will do when we apply it to new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What data mining technique groups objects together based upong maximizing the intraclass similarity and minimizing interclass similarity

A

clustering
inter- among
intra-within

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the tools and techniques that are used in the large scale or big data arena

A

data mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What new mindset is needed to begin data mining using big data

A

we will need to be open to finding relationships and patterns we never imagined existed in the data we are about to examine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In “the big data future has attived” by michal malone the statement is made that:

  • metadata is more IM than the big data itself
  • the major challenge to big data analysis will be overcome bc the fruits of big data are too valuable
  • discovery of this “metadata” may prove to be the undoing of big data analysis
  • that privacy issues will prevent big data analysis from advancing beyond wht weve already seen
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four categories of analytic tool available in data mining?

A

prediction, classification, clustering, association

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What data mining toold allows us to predict a class of objects whose label is unknown to us?

A

prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Forecasting model in stats, is what in data mining?

A

algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data mining term “score” is known as __ in stats

A

forecast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What stat terminology is referred to as a record in data mining terminology?

A

observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 5 steps identified by SAS for the data mining process

A

sample, explore, modify, model, and assess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The data mining process that involved creating, selecting, or transforming data is called

A

modify

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The data mining process step that involved data cleansing is called

A

explore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In The invisible digital hand, the replacement of the visible hand in competition by the digitized hand…

  • is usually accompanied by fewer firms in the marketplace
  • could result in less price comparison and more impulse buying
  • can give rise to anticompetitive behavior
  • does not give rise to the “frenemy” relationships
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Most economic time series are integrated in what order?

A

one
Can’t use ARIMA with trend. Must integrate it (another name for taking 1st diffrences). Most time series are integrated in one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which of the following models utilizes a transformed series to induce a stationary series?

  • ARIMA(1,0,1)
  • ARIMA(1,0,0)
  • ARIMA(1,1,1)
  • ARIMA(0,0,1)
A

3- the I has to be a 1 bc it’s transformative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which of the following is NOT a char of a time series best represented as an ARIMA (3,0,1)

  • og series is stationary
  • autocorrelation function has one dominant spike
  • the partial autocorrelation function has one dominant spike
  • the partial autocorrelation function has 3 spikes
  • none are correct
A

one diminant spike

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following is not a first step in the ARIMA model selection process

  • examine teh ACF of the raw series
  • examine the PCF of teh raw series
  • test the data for stationarity
  • estimate an ARIMA (1,1,1) model for reference purposes
  • all of the options are correct
A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the Q stat based on?

A

estimated autocorrelation function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the Q stat used to test?

A

whether a series is white noise or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

T/F the Q stat follows the chi squared distr.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What tests whether there is residual autocorrelations as a set are sig diff from zero

A

q stat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

ARIMA models require that data be….

A

stationary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In what situation does the ARIMA model have a decided advantage over standard regression models?

A

when we don’t know the predictors of the variable to be forecast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The philosophy of the box-jenkins methodology of using ARIMA model assumes what?

A

that the series we are observing sdtarted as white noise and was transformed by the black box process into the series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Which of the following does the box-jenkins methodology of using ARIMA models attemps to discern?

  • that they correct black box could have produced explanatory variables
  • that the correct black box could have produced an observed time series
  • that the correct black box could have produced such a series from white noise
  • that the correct black box could have produced a patterned time series
A

that the correct black box could hyave produced such a series from white noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How could you graphically describe the process of the box-jenkins methodology of using ARIMA?

A

white noise- black box- observed time series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

A moving average model is simply one tha predicts Yt as a function of the __ in predicting Yt

A

white noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How is the equation for the autoregressive model diff than the equation for the MA model

A

the dep variable depends on its own previous values rather than the white noise series or residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why is stationarity important?

A

**bc a series needs to be stationary before you identify the correct model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is one method to help us achieve stationarity?

A

differencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

An ARIMA(p,d,q) model is one that has had differncing used to make a time series

A

stationary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Using the Ljung box stat applied to a samle w 30 degrees of freedom we cannot reject the null of a white noise process if the sample Q-value is less than __ at the 10% level of significance

A

40

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the null hypothesis being tested using the Ljung-box stat?

A

the set of autocorrelation is jointly equal to zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What problem arises when applying ARIMA type models to highly seasonal monthly data?

A

extremely high order AR and MA processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Besides using sophisticated ARIMA type models capable of internally handling data seasonality, an alt is to use which of the following
-seasonal dummy variable
-trend dummy variables
-deseasonalized data, then reseasonalize to generate forecasts
-holts smoothing
all are correct

A

deseasonalized data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the key diff bw ARIMA type models and multiple regression models?

A

use of explanatory variables (arima doesn’t have any)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

In the classical time series decomposition algebraic model, Y=TxSxCxI, what is C

A

measurement of the very long tem movement of the data that are often though of as waves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

In the classical time series decomp algebraic model, Y=TxSxCxI, what is I

A

measurement of the irregular movement or random variations in the series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

If a biz cycle always had the same vertical distance from trough to peak, it would be called

A

constant amplitude

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the centered moving average

A

the series that remains after the seasonality and irregular components have been smoothed out by using moving averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

In what situation does teh ARIMA model have a decided advantage over standard regression models?

A

when we don’t know the predictors of the variable to be forecast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

A classification models misclassification rate on the validation data is a better measure of the model’s predictive ability on new (unseen) data , then its misclassification rate on teh training data. Explain whether this statement is accurate and why that is so

A

This statement is accurate. A classification model uses its validation data to test the models accuracy or its predictive ability. Therefore, the misclassification rate on the validation data is a better indicator of a models predictive ability than the training data.
The misclassification rate on the validation set is a better measure because we want to see how well the model can function on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

The 1st step in data mining procedures according to SAS and IBM is to “sample” the data. Sampling here refers to dividing the data available for analysis into at least two parts: a training data set and a validation data set. Why do both SAS and IBM? SPSS recommend this as a first step? Wht are the risks of ignoring this procedural requirement?

A

Both SAS and IBM recommend sampling as the first step since we need the training data set to build the model and validation data set to test the model’s accuracy. The risk in ignoring this step is creating bias. If a data scientist uses the same data to both build and test the model, and that model is overfit, then most likely the results will also be overfit.
need to see how this model will work on data we have and in the real world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

How do unstructured and structured data differ? Which is the more prevalent form of data? How would the following be classified: # in excel spreadsheet, text files, video images, audio files

A

Structured data: data that does have a predefined model
Unstructured data: Data that does not have a predefined data model
Unstructured data is a more prevalent form of data because it comes in many different forms which we are expose to daily
Excel spreadsheet: structured data
A thousand text files: unstructured
A thousand video images: unstructured
A thousand audio filed: Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Some data mining algorithms work so well they have the tendency to overfit the training data. What does the term overfit mean, and what does overlooking it cause for the data scientist?

A

Overfitting: When we put too many attributes (or try to account for too many patterns) in a model, including some unrelated to the target.
If a data scientist overfits their data they will incorrectly explain some variation in the data that is nothing more than a chance variation. In other words, they will have mislabeled the noise in the data as part of the “true signal”
If you overfit, you model the noise in the data. If you model it, then replicate it, your model will have a great fit, but a low accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are ways to make a forecast less biased

A

diff forecast methods
diff forecasters
diff sources of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

For a combined forecast to be unbiased, each of the forecasts cannot

A

consistently over or underestimate any values

52
Q

If we were to score a new cust based upon the attributes we used in the algorithm, we would be accurate in the prediction about 90% of the time if we always scored the indiv as “no accepting a loan” bc that indeed is what most cust. have done in the past. Why not accept being correct 90% of the time w this very simple decision rule?

A

Because with data mining we have access to the information and tools that can help us do better than predicting correctly 90% of the time. So in this scenario, we could look at our lift chart and find the customers with the highest probability of accepting a personal loan, and market to them, in order to have a better chance of finding people who will accept the loan.

53
Q

Data has the char of being nonrivalry. What is this and why is it IMP to realize that data has this char?

A

Nonrivalry: characteristic that means that one person’s use of the good to create value does not diminish the value another can extract from the data.
It’s important to realize that data has this characteristic so more researchers and data scientists can use data, because every time the data set is used, it can be used to obtain different results. Every researcher can use a data set with a different purpose and get different conclusions.

54
Q

The lift chart and teh confusion matrix are both standard diagnostic tools used to evaluate a data mining algorithm. Don’t the two measures display the same info? Explain any diff between the two measures?

A

Confusion matrix: This shows model performance. There is a confusion matrix for both the validation data and the training data. Most often, the results from the validation model are most relevant since they show how the model performed on unseen data. The validation confusion matrix shows model performance in classification on data that was not used to build the model. Gives results for the amount of correct classifications and the misclassifications.
Lift chart: This is the standard for accuracy in data mining. These charts help to determine how effectively the model can reorder the data set, by placing the individuals who have the highest probability of success on top, and those with the lowest probability of success on bottom. By looking at the chart, you can determine how well your model is doing compared to a naïve model.
confusion- what you misclassified vs classified correctly
lift- for each % of data, how many you got right or wrong

55
Q

What’s the 1st step in combining forecasts? (2 parts)

A
  • make sure the # or rows of hist data exceed the # of forecast values
  • consider how the data should be set up
56
Q

For ensemble models, what are two IM char of the 45 degree line for a perf forecast?

A

slope is equal to one

intercept is equal to zero

57
Q

What would a graph of a downward bias forecast look like?

A

forecast values above the perf forecast line

58
Q

What info is likely lost if a particular forecast is ignored bc it is not the best forecast?

  • the discarded forecast may make use of a type of relationship ignored by the best forecast
  • judgemental data included in the discarded forecast may not be included in the best forecast
  • data included in the discarded forecast may not be included in the “best” forecast
  • Some variables included in the discarded forecast may not be included in the “best” forecast
A

1 and 4

59
Q

how can you obtain forecast improvement?

A

by combining forecasts from diff models

60
Q

What is the most common source of bias in forecasting?

A

preconcieved notions of the forecaster

61
Q

if a forecast model w a lower MAPE is more heavily weighted, the combined forecast will ___

A

improve

62
Q

Steps to combine forecasts

A
  1. first consider how the data should be set up
  2. regress the actual values of the variable to be forecast on the two forecast results for the historic period.
  3. When there’s no bias proceed to the same regression bu force the constant to be 0
63
Q

The premise of constructing combined forecasts would be satisfied by which of the following scenarios?

  • using exp smoothing method
  • using delphi method
  • using the MA method
  • using multiple MA methods
A

4

64
Q

What is the purpose of the 45 degree line of a perfect foecast?

A

to show that the forecast would not have bias

65
Q

What are the 3 types of biz cycle indicators (give ex of each)

A

leading economic index- lead turning points in economic activity (stock prices, avg weekly manufacturing hours, ISM new order index)
coincident index- are coincident with turning points in economic activity (ie personal income and industrial production)
lagging index- lag turning points in economic activity (avg prime rates, avg duration of unemployment, labor cost per unit of output)

66
Q

Which of the following is true abou the convention in which all forecasts are equally weighed in the composite process?

  • such a weighing process removed any model bias of the part of the forecaster
  • such a weighing process minimizes RMSE
  • Such a weighing process minimizes forecast bias
  • Such a weighing process usually minimizes forecast error variance
A

1

67
Q

what is an advantage of using regression process to estimate the optimal weights in the forecast combination process?

A

a test of the combine forecast model bias can be performed

68
Q

How can ensembles help make better decision?

A

it minimizes the total error giving you more predictive accuracy

69
Q

a biz cycle w the same length of time bw successive peaks would be called

A

constant periodicity

70
Q

Which are benefits of using the decomp method?

  • info is consistent w the way managers look at info
  • they provide excellent forecasts
  • do not involve a lot of math and stats which make them easy to explain
  • they are simple bc they make no adjustments seasonality
A

1,2,3

71
Q

the series that remians after the seasonality and irregular components have been smoothed out by using MA

A

Centered moving average

72
Q

if a biz cycle has the same vertical dist from trough to peak its called

A

constant amplitude

73
Q

How is the cyclicality measured in time series decomp?

what are the rules associated w it?

A

by the cycle factor (CMA/CMAT)

if CF>1: indicated the deseasnalized value for that period is above the long term trend of the data

74
Q

what is the ratio of the CMA to the CMAT?

A

cycle factor

75
Q

what is the most diff component of a time series decomp to analyze and project into the forecast period

A

cycle factor

76
Q

Wht does decomp do?

A

identify long term trend, seasonal fluctuation, cyclical movements, and irregular fluctuation. Then break the series into its components by breaking the series int its component parts and then reassembling the parts to construct a forecast

77
Q

What is the equation for decomp?

A

Y=T(rend)xS(easonality)xC(yclicality)xI(irregular variations)

78
Q

Which of the following statements are correct for classical time series decomp?

  • classical time series decomp are very similar to the delphi method
  • classical time series decomp use the concepts of trend projections
  • they account for seasonality in a multiplicative way
  • they use the concepts of MA
A

2,3,4

79
Q

How is the trend measured in a decomp? How is seasonality? Cyclicality?

A

t: CMAT
S: seasonal indices
C: cycle factor

80
Q

why do we deseasonalize the data before we create a decomp?

A

it allows us to better see the underlying pattern in the data and provides a measure of the extent of seasonality in the form of seasonal indeces

81
Q

period of time between the beginning trough and peak is

A

expansion phase

82
Q

when deseasonalizing the data, the MA represents the typical level of Y for the

A

year that is centered on that MA

83
Q

When business cycles are true cycles they have

A

constant periodicity and constant amplitude and regulatity

84
Q

what is a two period MA of the MA

A

centered MA

85
Q

What is the best way to get the best measure of the degree of seasonality for a decomp?

A

compare the actual value with the deseasonalized value

86
Q

If the seasonal factor is greater than 1

A

indicates a period where the value is greater than the quarterly avg for the year

87
Q

Which of the following are correct deciding how to estimate the long term trend from deseasonalized data?

  • values of a and b are estimated using the MA
  • simple linear equation is used
  • time =1 for the first period, and increases by 1 for each quarter after
  • trend equation is used to estimate the trend value of the cma for the historical and forecast periods
A

2,3,4

88
Q

building permits, int rate on treasury bonds, avg consumer expectation for biz conditions are

A

leading index

89
Q

Consumer installemtn credit outstanding to personal income ratio, commerical and industrial loans, and consumer price index are what type of indicator

A

lagging index

90
Q

Which of the following statements are correct about the first step in deseasonalizing a time series decomp model?

  • the series used should contain the same number of periods as there are in the seasonality that you want to identify
  • remove the short term fluctutations from the data so that longer term trend and cycle components can be more clearly identified
  • short term fluctuations can include both seasonal patterns and irreglar variations
  • short term fluctuations can be removed by calculating the appropriate mutiple regression
A

1,2,3

91
Q

ARRIMA looks at only the ___ of a time series

A

past pattern

92
Q

ARIMA uses the ___ as a starting balue and proceeds to analyze ___ forecsting errors to select the appropriate adjustment for future time periods

A

recent observation

recent

93
Q

Which of the following are true regarding ARIMA?

  • its a combo of AR and MA model
  • produced from white noise series
  • ind variable depends on its own previous values rather than the white noise series or residuals
  • its similar to exp smoothing and could be called a weighted avg model
A

1,2

94
Q

What are 3 statments explaining white noise?

A
  • numbers are normally and ind. distributed
  • purely random series of numbers
  • it is assumed the observed time series started as white noise
95
Q

What are two tool used to determine which black box is appropriate representation of any given time series?

A

partial autocorrelatin coefficient

autocorrelation coefficient

96
Q

What are good guidelines for selecting an ARIMA model?

  • measures of accuracy like MAPE and RMSE are useful in identifying the degree of fit
  • the less complex the model the more useful it is. Simple models w few coefficients are best
  • determine if the model fits the data well
  • only the ljung box should be used for selecting a model. nothign else
A

1,2,3

97
Q

How should you make your selection of ARIMA

A
  • simple models are the best

- it is possible for two or more models to be very similar in their fit of data

98
Q

What are the advantages of the box-jenkins methodology of using ARIMA as compared to other time series methods

A

allows for greater flecibility in the choice of the correct model
extracts a great deal of info from the time series
encourages examination of a wide variety of models in search for an acceptable one

99
Q

What are the advantages of the box-jenkins methodology of using ARIMA as compared to other time series methods

A

allows for greater flexibility in the choice of the correct model
extracts a great deal of info from the time series
encourages examination of a wide variety of models in search for an acceptable one

100
Q

What are two methods to achieve stationarity?

A

take longs of the og time series to transfer the trend in variance to a trend in the mean
differencing the time series to remove a trend

101
Q

if the correlation coefficient is closet to +1, this indicates

A

more positive correlation

102
Q

which are char about nonstationary time series?

  • real world time series are most often non stationary
  • the variability of the time series does not change over time
  • the autocorrelations are usually sig diff from zero at first and then gradually falls off to zero
  • the mean value of the time series changes over time
A

1.3.4

103
Q

what is the 3rd rule in the identification stage of the box jenkins identification process?

A

if neither function falls off abruptly, but both decline toward zero in some fashion, the appropriate model is an ARMA(p,q) type

104
Q

what is the 3rd rule in the identification stage of the box jenkins identification process?

A

if neither function falls off abruptly, but both decline toward zero in some fashion, the appropriate model is an ARMA(p,q) type

105
Q

What is one of the basic types of models used when choosing the correct black box?

A

autoregressive model

106
Q

Which of the following is true regarding the box jenkins identification process?

  • first step must be done manually rather than automatically
  • a stationary raw series is necessary for the first step
  • raw series is examine to identify one of the many available models
  • it is an iterative process since it loops thru the processes many times before finding an appropriate model
A

2,3,4

107
Q

What is the final step in arima/boxjenkins?

A

if the remaining series is not white noise, pass it thru another black box

108
Q

what is the second step in box jenkins

A

estimate paramenters of tentative model

109
Q

what are two char of white noise

  • no relationship bw consecutively observed values
  • white noise forecasted does not start out as white noise but is transformed into white noise thru the black box process
  • previous values do not help in predicing future values
  • it is a set series of predefined numbers
A

1,3

110
Q

What does the ljung box stat test

A

whether the residual autocorrelations as a set are sig diff from zero

111
Q

How is the equation for the autoregressive model similar to the MA model?

A

the autoregrissive is similar to the MA model except that the dep variable depends on its own previous values

112
Q

spikes in the partial autocorrelation function indicate?

A

autoregressive terms

113
Q

Which of the following would be considered tests for correctness of the model in the box jenkins identification process?

  • ljung box pierce q stat
  • perform a chi sqrd test on the autocorrelations of the residuals using the ljung box
  • pass an MA (1) box over the contrived data set to identify if its white noise
  • pass a MA (1) over the white noise and turn it into a MA(1) data set
A

1,2,3

114
Q

Which of the following are correct about box jenkins

  • exponential smoothing is a special case of ARIMA
  • Box jenkins is a way of forecasting a variable by looking only at the most recent pattrn of the time series
  • box jenkins is best suited for longer range rathern than shorter range
  • it uses arima models
A

1,3,4

115
Q

a time series numbers are monotonically increasing throughout the time period and shows dominant autocorrelations that only grdually become smaller. What types of time series would this be?

A

nonstationary

116
Q

Which of the following are considered to be definitions of data mining?

  • analysis of databases, data warehouses, and data marts
  • extraction of knowledge or info from large amts of data
  • assigning a model to data for research purposes
  • extraction of useful info from large, unstructured databases
A

1,2,4

117
Q

What is an ex of clustering analysis tool?

  • bank deems u credit worthy or a credit risk based on info u submit
  • post office recognizes the handwriting on an evelope as alphabetic characters and numbers
  • retail store determines if ur credit card is legetimante or not when you try to make a purchase
  • university students are identified who have special needs
A

4

118
Q

The model phase of a data mining process would include which of the following?

  • to set the parameters necessary to execute the process
  • to determine the data mining task required
  • to examine the data graphically
  • to select an appropriate algorithm
A

1,2,4

119
Q

The modify step in the data mining process involves…

A

creating, selecting, or transforming the data

120
Q

Big data provides a mesh of data in which few mistakes may ____ the outcomes predicted by the preponderance of data

A

not affect

121
Q

Where does partioning in data mining come from?

A

holdout periods in standard data forecasting models

122
Q

In data mining, T/F the entire data set is needed to build a model

A

F, it is not needed

123
Q

Which of the following would be part of the “explore” activity in the data mining process?

  • creating summary stats of the attributes
  • examining the data graphically
  • data cleansing
  • running regressions on the data
A

1,2,3

124
Q

Which of the following would correctly contrast data mining w database management?

  • database management is extracting useful info from large, unstructured databases where data mining is extracting specialized or grouped data
  • queries are well defined in database management but less structured in data mining
  • a query in database management would be “find all cust in atlanta”, and in data mining would be “group all cust w similar buying habits”
  • data mining is more forward looking, while database management is more past focus
A

2,3,4

125
Q

Compare 3 things that data mining and biz forecasting have in common or are diff

A
  • both measure the certainty of trustworthiness associated w the patterns discovered
  • in data mining u simultaneously search for diff kinds of patterns in parallel
  • in biz forecasting search for set patterns
  • in biz forecasting the expectation is that the data will contain some level of variation
  • in data mining, patterns are not pre specified
126
Q

Spikes in the PCF mean

A

autoregressive terms

127
Q

Spikes in the ACF mean

A

moving average terms