Pensum Flashcards

(33 cards)

1
Q

What are the two main machine learning techniques for data mining?

A

Supervised machine learning and unsupervised machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are examples of supervised machine learning?

A

Decision trees, linear classifiers, linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are examples of unsupervised machine learning?

A

Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are characteristics with unsupervised machine learning?

A

No specific target value for unsupervised methods. System is just looking for pattern in the data but not acting like “a teacher”. Data can be grouped very nicely into a small number of categories. We just have to look for the result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the goal with clustering?

A

Goal is to group together similar instances using some metric of similarity - so create groupings where the members of a given group are similar to each other. For example group similar customers together and design different campaigns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are characteristics with clustering?

A

It is light classification but the groupings are not predefined. More open ended than classification and regression. Could find a way to group similar customers together. May or may not relate to the churn question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the fundamental goal of data mining techniques?

A

Exploration to find patterns in dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is similarity matching?

A

Instances are compared based on their attributes to determine how similar they are. Amazon - find books that are similar to a book you have read. The most similar will be a book with all three attributes (if there were three in the one you already read).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When do we use similarity matching?

A

The general idea of similarity matching placeable in many different forms of data mining including classification, regression, and clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is important with similarity matching?

A

Important to have information about the relevant attributes. And information about which one attributes is most important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is regression?

A

Numerical value. Related to classification but there is a difference. Classification predicts wether there is going to happen something. Regression predicts how much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of when we would use regression.

A

How much will a customer spend? that will be solved with regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does supervised machine learning do in general?

A

A target value specified for each instance. Examining instances one by one. We can simply compute how often the system makes the right choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is classification?

A
Classification involves defining a small number of classes and then trying to predict for each instance, which class they belong to. In churn example classification is a natural one - one for will churn and one for will not churn
Each instance is labelled with a target value indicating what class it belongs to.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data preparation?

A

Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we need data preparation?

A

It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. Data preparation is often a time consuming process and heavily prone to errors.

17
Q

What is a database?

A

Database collects, stores and manages information so users can retrieve, add, update or remove such information. It presents information in tables with rows and columns.

18
Q

What tools can you use to assess data?

A

Through accuracy, precision/ recall, testing/training, cross-validation

19
Q

What technique do you use when you want to find out how much a customer wants to use a service?

20
Q

What does classification do?

A

Predicts the class each individual belongs to

21
Q

What does regression do?

A

Estimates a numerical value for each individual

22
Q

What does clustering do?

A

Identifies similar individuals based on data known about them

23
Q

Can we find groups of customers who are likely to cancel the service when the contract expires?

A

This is a problem for supervised learning

24
Q

How can a CSV data file look like?

A

sunny,short,boring,no

25
How do you calculate entropy? (which is a technique for information gain)
x
26
A Linear Classifier is a Parameterized Model -- the Parameters are what is learned in the training process. What are the parameters for a Linear Classifier?
The weights
27
A good way to recognise overfitting is:
Compare accuracy on holdout data with accuracy on training data
28
kNN is a data mining technique that can be used for?
Classification and regression
29
What are the most widely used techniques in data mining?
Classification, regression and clustering.
30
What does the data mining technique co-occurrences and associations/market-basket analysis do?
Finding items that go together. For example, by analyzing market basket data, you might find that customers who bought a pork sandwich also bought a water. Learning these associations can be very useful.
31
What is the core of data analytical thinking?
``` Data should be considered an asset Can help to structure business problems Applying data science to a well-structured problem vs exploratory data mining ```
32
What is the aim of generalisation in data analytical thinking?
We want patterns that generalize to data we have not seen
33
Mention four ways to extract knowledge from data
1. Identifying informative attributes 2. Fitting a numeric function model to data 3. Controlling complexity - generalization and overfitting 4. Calculating similarity between objects