L5: Data science solutions Flashcards

Question 1

Q

Crisp DM

Answer

A

Import –> Tidy <–> Visualize <–> Model <–>
<–> Transform

Question 2

Q

Six type of questions

Answer

A

Data analysis flowchart

Descriptive
Exploratory
Inferential
Predictive
Causal
Mechanistic

Question 3

Q

What is a model

Answer

A

A simplified representation of reality crated to serve a purpose

Question 4

Q

What is a predictive model?

Answer

A

· A formula for estimating the unknown value of interest: the target
The formula can be mathematical, logical statement (e.g., rule), etc.

Question 5

Q

What is prediction?

Answer

A

· Estimate an unknown value (i.e. the target)

Instance / example:
· Represents a fact or a data point
Described by a set of attributes (fields, columns, variables, or features)

Question 6

Q

Model induction:

Answer

A

o The creation of models from data

Question 7

Q

Training data

Answer

A

The input data for the induction algorithm

Question 8

Q

Test data

Answer

A

Data used to test the model

Question 9

Q

How to choose a model

Answer

A

Based on the question!!
· Descriptive questions demand descriptive statistics, or unsupervised learning
· Predictive questions are best answered with predictive models, e.g. machine learning
· For all other questions, inferential statistics are probably your best bet
· Pay attention to
o Your dependent variable> kind
o Your independent variables> kind, number
Assumptions

Question 10

Q

Non parametric tests

Answer

A

Non-parametric tests are used when we the data is non-parametric. This is the case if

the dependent variable does not represent a continuous interval-scaled or ratio-scaled variable

errors (also called residuals) which represent the difference between the expected or predicted values and the observed values do not approximate a normal distribution

the dependent variable is ordinal (it represents ranks)

Question 11

Q

Chi Square

Answer

A

The test evaluates whether the observed frequencies in a contingency table match the expected frequencies if the two categorical variables are independent.

E.g. Suppose you want to test if there’s an association between gender (male, female) and preference for a particular product (like, dislike)

Assumptions:
· Observations are independent of each other.
· Categories are mutually exclusive.
Expected frequency for each cell should be 5 or more for a 2x2 table. For larger tables, 80% of the cells should have expected frequencies of 5 or more, and all cells should have expected frequencies of 1 or more.

Question 12

Q

L5: Data science solutions Flashcards

(12 cards)