L5: Data science solutions Flashcards

1
Q

Crisp DM

A

Import –> Tidy <–> Visualize <–> Model <–>
<–> Transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Six type of questions

A

Data analysis flowchart

Descriptive
Exploratory
Inferential
Predictive
Causal
Mechanistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a model

A

A simplified representation of reality crated to serve a purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a predictive model?

A

· A formula for estimating the unknown value of interest: the target
The formula can be mathematical, logical statement (e.g., rule), etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is prediction?

A

· Estimate an unknown value (i.e. the target)

Instance / example:
· Represents a fact or a data point
Described by a set of attributes (fields, columns, variables, or features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model induction:

A

o The creation of models from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Training data

A

The input data for the induction algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Test data

A

Data used to test the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to choose a model

A

Based on the question!!
· Descriptive questions demand descriptive statistics, or unsupervised learning
· Predictive questions are best answered with predictive models, e.g. machine learning
· For all other questions, inferential statistics are probably your best bet
· Pay attention to
o Your dependent variable> kind
o Your independent variables> kind, number
Assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Non parametric tests

A

Non-parametric tests are used when we the data is non-parametric. This is the case if

the dependent variable does not represent a continuous interval-scaled or ratio-scaled variable

errors (also called residuals) which represent the difference between the expected or predicted values and the observed values do not approximate a normal distribution

the dependent variable is ordinal (it represents ranks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Chi Square

A

The test evaluates whether the observed frequencies in a contingency table match the expected frequencies if the two categorical variables are independent.

E.g. Suppose you want to test if there’s an association between gender (male, female) and preference for a particular product (like, dislike)

Assumptions:
· Observations are independent of each other.
· Categories are mutually exclusive.
Expected frequency for each cell should be 5 or more for a 2x2 table. For larger tables, 80% of the cells should have expected frequencies of 5 or more, and all cells should have expected frequencies of 1 or more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly