1.1.2 Introduction to Data Science - Business Problems and Data Science Solutions Flashcards

1
Q

Describe when each type of data mining algorithm, such as classification, regression, similarity matching, clustering, co-occurrence grouping, profiling, link-prediction, data reduction, and causal modeling, should be used.

A

Classification: When you want to predict what classes and individual belongs to
Regression: When you want to estimate or predict a numerical variable for each individual
Similarity matching: When you want to identify similar individuals based on data known about them
Clustering: When you want to group individuals in a population together by their similarity, but not driven by any specific purpose
Co-occurrence grouping: When you want to find associations between entities based on transactions involving them
Profiling: When you want to characterize the typical behavior of an individual, group, or population
Link-prediction: When you want to predict connections between data items
Data reduction: When you want to take a large data set and replace it with a smaller set for efficiency reasons
Causal modeling: When you want to understand what events or actions actually influence others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the differences between regression and classification.

A

Regression estimates or predicts a numerical value of a variable; classification groups items (e.g. people, companies) into classes and has a categorical target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Contrast supervised learning with unsupervised learning.

A

Supervised learning has a specific target (e.g. will a customer leave after contract expires?) while unsupervised learning has no specific target or purpose stated (e.g. do customers fall into different groups?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the algorithms that can be used for supervised and unsupervised learning.

A

Supervised: Regression, classification
Unsupervised: clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Contrast data mining with the use of data mining results.

A

Data mining: mining data to produce a model
Use of data mining result: Applying the model to new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List and describe the steps used in Cross Industry Standard Process for Data Mining (CRISP-DM).

A

1. Business Understanding- business problem should be modelled as one or more data science problems
2. Data Understanding - strenghts/weaknesses and cost benefit analysis of the data
3. Data Preparation - often data needs to be manupulated to produce better results (e.g. in tabular form)
4. Modelling- model or pattern that identifies consistencies in data
5. Evaluation - assess data mining results for validity
6. Deployment - Use data mining result to receive return on investment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the reason for having an iterative process involved in CRISP-DM.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the characteristics of credit card and Medicare fraud.

A

Nearly all credit card fraud is caught (either by customer or cc-company), thus credit card transactions have reliable labels (good for supervised techniques)

Medicaire fraud is more complicated and requires unsupervised learning approaches such as profiling, clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

List the reasons for deploying the data mining system itself rather than the models produced by a data mining system.

A
  1. the world may change faster than the data science team can adapt with fraud and intrusion detection
  2. a business has too many modeling tasks or their data science team to manually curate each model individually.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly