Data Mining - Chapter 2 Flashcards

1
Q

What is classification?

A
Examining data and deciding in which class or category they will fall. 
--> Trying to predict a class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is prediction?

A

Trying to predict the value of a numerical variable.

–> Can be used for both continious as categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are association rules?

A

Rules designed to find general association patterns between items in a large database. Generates rules general to an entire population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is collaborative filtering?

A

Making rules for an invidivual user opposed to the general public, based on individual history as well as the history of others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is data reduction?

A

The process of consolidating a large number of records into a smaller set.

  • -> You do this because the performance of data mining algorithms is often improved when the number of variables is limited.
  • -> Often done by clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is dimension reduction?

A

Reducing the numer of variables (instead of the number of rows).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data visualization?

A

Data exploration through creating charts and dashboards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are supervised learning algorithms?

A

Algorithms that predict numerical values or classifications tht are trained by using training, validation and testing data.
Of the training data, it is already known what the value of the outcome of interest is. Therefore, you can see how well the algorithm performs, you can tune it with validiaton data and you can measure it against other algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are unsupervised learning algorithms?

A

Algorithms that use no outcome variable to predict or classify.
Examples: association rules, dimension reductions methods and clustering techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 10 steps of data mining?

A
  1. Develop an understanding of the purpose of the data mining project.
  2. Obtain the dataset to be used in the analysis.
  3. Explore, clean, and preprocess the data
  4. Reduce the data dimension, if necessary
  5. Determine the data mining task
  6. Partition the data
  7. Choose the data mining techniques to be used
  8. Use algorithms to perform the task
  9. Interpret the results of the algorithms
  10. Deploy the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is SEMMA?

A

A methodology of data mining by the company SAS. It encompasses the previous 10 steps.

  1. Sample
    Take a sample. Partition into training/testing
  2. Explote
    Examine data set statistically and graphically
  3. Modify
    Transform variables/put in missing values
  4. Model
    Fit predicitive models
  5. Assess
    Compare models using a validation dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a slice of data?

A

A slice returns an object usually containing a portion of a sequence, such as a subset of rows and columns from a data frame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which two techniques does pandas use to access rows in a data frame?

A
  1. loc
    More general, allows accessing rows using labels
  2. iloc
    Less general, only allows using integer numbers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is oversampling?

A

Putting heavier weights in your sampling procedure to overweight the rare class relative to the majority class. Otherwise your model might not be able to identify that records belong to the rare class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which types of variables are there?

A
  1. Numerical (Continious, integer & date)
  2. Text
  3. Categorical (numerical/text)
    - Nominal
    - Ordinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which data mining technique can not deal with continious variables?

A

Naive Bayes

17
Q

Which data mining techniques can not deal with categorical variables?

A

Trick question - Almost every technique can deal with them.

Note: Ordered variables can sometimes be coded numerically to treat them as continious variables.

18
Q

How can you use nominal categorical variables?

A

They often can not be used in data mining techniques. You can decompose them into dummmy variables to be able to use them.

19
Q

What is meant by explore, clean and preprocess the data?

A

Verifying that the data are in reasonable condition.

How to handle missing data, are there outliers etc.

20
Q

What is meant by determining the data mining task?

A

Translating the general first question into a data mining specific question. Do you need to do classification, prediction, clustering etc.?

21
Q

What is meant by partitioning the data?

A

Dividing it up into training, validation and testing sets in case you have a supervised task.

22
Q

What are the disadvantages of having too many variables in your model?

A
  • Your model becomes more complex and you need more records to asses relationships between variables;
  • There will be more data quality and availabilitity issues
  • Require more data cleaning and preprocessing
  • Higher risk of overfitting

Rule of thumb: 10 records per predictor variable

23
Q

What is normalization of data?

A

Bringing all variables to the same scale. Sometimes needed to run your algorithm well.

-> Substract the mean and divide by the standard deviation.

24
Q

What is overfitting?

A

The goal about a model is to make good predictions about any additional data over which you run your algorithm.

If you have a function that represents your sample too perfectly, it does not take the ‘general’ relation between variables into account, just the ones from the sample. Therefore, it will not be able to predict future values well. This is overfitting.

-> Can be seen if the function in a graph is too close to the actual data points.

25
Q

How can you prevent overfitting?

A

Partitioning your data and train your data with one data set to consequently test it with a different dataset to analyze its performance.

26
Q

What is CRISP-DM?

A

A similar methdology to SEMMA. It stands for

CRoss Industry Standard Process for Data Mining

27
Q

What are the six steps of CRISP-DM?

A
  1. Business understanding
  2. Data understanding
  3. Data preparation
    (These three steps are 85% of the project time)
  4. Model building
  5. Testing and evaluation
  6. Deployment
    (Last three steps are similar to SEMMA)
28
Q

What is underfitting of a model?

A

The model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).