Pensum Flashcards by Sine Klingsholm

What are the two main machine learning techniques for data mining?

Supervised machine learning and unsupervised machine learning

How well did you know this?

Not at all

Perfectly

What are examples of supervised machine learning?

Decision trees, linear classifiers, linear regression

How well did you know this?

Not at all

Perfectly

What are examples of unsupervised machine learning?

Clustering

How well did you know this?

Not at all

Perfectly

What are characteristics with unsupervised machine learning?

No specific target value for unsupervised methods. System is just looking for pattern in the data but not acting like “a teacher”. Data can be grouped very nicely into a small number of categories. We just have to look for the result.

How well did you know this?

Not at all

Perfectly

What are the goal with clustering?

Goal is to group together similar instances using some metric of similarity - so create groupings where the members of a given group are similar to each other. For example group similar customers together and design different campaigns.

How well did you know this?

Not at all

Perfectly

What are characteristics with clustering?

It is light classification but the groupings are not predefined. More open ended than classification and regression. Could find a way to group similar customers together. May or may not relate to the churn question.

How well did you know this?

Not at all

Perfectly

What are the fundamental goal of data mining techniques?

Exploration to find patterns in dataset.

How well did you know this?

Not at all

Perfectly

What is similarity matching?

Instances are compared based on their attributes to determine how similar they are. Amazon - find books that are similar to a book you have read. The most similar will be a book with all three attributes (if there were three in the one you already read).

How well did you know this?

Not at all

Perfectly

When do we use similarity matching?

The general idea of similarity matching placeable in many different forms of data mining including classification, regression, and clustering

How well did you know this?

Not at all

Perfectly

What is important with similarity matching?

Important to have information about the relevant attributes. And information about which one attributes is most important.

How well did you know this?

Not at all

Perfectly

What is regression?

Numerical value. Related to classification but there is a difference. Classification predicts wether there is going to happen something. Regression predicts how much.

How well did you know this?

Not at all

Perfectly

Give an example of when we would use regression.

How much will a customer spend? that will be solved with regression.

How well did you know this?

Not at all

Perfectly

What does supervised machine learning do in general?

A target value specified for each instance. Examining instances one by one. We can simply compute how often the system makes the right choice.

How well did you know this?

Not at all

Perfectly

What is classification?

Classification involves defining a small number of classes and then trying to predict for each instance, which class they belong to. In churn example classification is a natural one - one for will churn and one for will not churn
Each instance is labelled with a target value indicating what class it belongs to.

How well did you know this?

Not at all

Perfectly

What is data preparation?

Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling

How well did you know this?

Not at all

Perfectly

Why do we need data preparation?

Study These Flashcards

It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. Data preparation is often a time consuming process and heavily prone to errors.

What is a database?

Study These Flashcards

Database collects, stores and manages information so users can retrieve, add, update or remove such information. It presents information in tables with rows and columns.

What tools can you use to assess data?

Study These Flashcards

Through accuracy, precision/ recall, testing/training, cross-validation

What technique do you use when you want to find out how much a customer wants to use a service?

Study These Flashcards

Regression

What does classification do?

Study These Flashcards

Predicts the class each individual belongs to

What does regression do?

Study These Flashcards

Estimates a numerical value for each individual

What does clustering do?

Study These Flashcards

Identifies similar individuals based on data known about them

Can we find groups of customers who are likely to cancel the service when the contract expires?

Study These Flashcards

This is a problem for supervised learning

How can a CSV data file look like?

Study These Flashcards

sunny,short,boring,no

How do you calculate entropy? (which is a technique for information gain)

A Linear Classifier is a Parameterized Model -- the Parameters are what is learned in the training process. What are the parameters for a Linear Classifier?

The weights

A good way to recognise overfitting is:

Compare accuracy on holdout data with accuracy on training data

kNN is a data mining technique that can be used for?

Classification and regression

What are the most widely used techniques in data mining?

Classification, regression and clustering.

What does the data mining technique co-occurrences and associations/market-basket analysis do?

Finding items that go together. For example, by analyzing market basket data, you might find that customers who bought a pork sandwich also bought a water. Learning these associations can be very useful.

What is the core of data analytical thinking?

``` Data should be considered an asset Can help to structure business problems Applying data science to a well-structured problem vs exploratory data mining ```

What is the aim of generalisation in data analytical thinking?

We want patterns that generalize to data we have not seen

Mention four ways to extract knowledge from data

1. Identifying informative attributes 2. Fitting a numeric function model to data 3. Controlling complexity - generalization and overfitting 4. Calculating similarity between objects

Pensum Flashcards

(33 cards)