notions Flashcards

(45 cards)

1
Q

actionable insight

A

an operational insight that can be implemented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

bad data

A

garbage in - garbage out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

re-create analysis

A

is quite difficult but important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

bias

A

things that can influence the decision in the wrong wayy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

analytics workflow

A

modularity - which tools approaches are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Directed acyclic graph

A

s a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

airflow

A

workflow manager to not user crontab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS - sum of differences between predict value and real on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real one

SST = SSR + ESS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
SSE/RSS(Residual sum of square) - sum of differences between predict value and real on

SST = SSR + ESS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Depndent variable

A

The one we are trying to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

OLS

A

Ordinary Least Squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OLS

A

Ordinary Least Squares (min SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R-squared

A

R^2 = SSR/SST, 1 is best, 0 is worst

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R-squared

A

R^2 = SSR/SST, 1 is best, 0 is worst

R-squared measures how much of the total variability is explained by this model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

adjusted R-squared

A

measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

F-statistic

A

??

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Linearity

A

Is the function linear? Does it fit well?

20
Q

No endegeneity

21
Q

Feature/target in ML

A

independent variable(feature), used to predict dependent variable(target)

22
Q

Regression intercept

A

A point where regression line crosses y-axis

23
Q

Regression coefficient

A

the coefficient on which should we multiple the feature

24
Q

p-value of the feature

A

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (means that feature can be used)

25
F-regression
Creates regressions for a feature(in case we have many of them) and a target
26
mean vs average
mean is the value in 50%, average is sum divided by number of elements
27
standardization
Find the mean and std deviation. value-mean/deviation
28
underfitting/overfitting
underfitting - low accuracy(doesn't capture any logic)/ overfitting too much accuracy (capture all the noise). Can be resolved by train(75%) and test(25%) datasets.
29
Multicollinearity
???
30
Dummy variables
In case of categories(BMW, AUDI, Opel) we want to create n-1 columns with dummy variables (1 if BMW, 0 if not) In this case no new column for Opel as it is obvious that if it's not BMW or Audi it is opel
31
Data cleaning
Remove outliers, qunatile, remove missing values
32
Models: linear, quadratic, exponential, logistic
Logistic: categorical regression
33
MLE
maximum likelihood estimation
34
Clusters
Maximize the similarity in cluster and dissimilarity between clusters
35
Cluster analysis
unsupervised learning as we don't know the result. Classification though deals with known outcomes and can be trained on train data.
36
Centroid
Center of mass of all data points in cluster analysis
37
K-means
1) Choose K. 2) Specify seed(centroid) 3) Assign each point to the closest centroid 4) Adjust the centroid based on selected points
38
Cluster analysis
unsupervised learning as we don't know the result. Classification although deals with known outcomes and can be trained on train data.
39
Centroid
Center of the mass of all data points in cluster analysis
40
K-means
1) Choose K. 2) Specify seed(centroid) 3) Assign each point to the closest centroid 4) Adjust the centroid based on selected points Repeat 3
41
WCSS
``` determine the number of clusters Within Cluster Sum of Squares Elbow method To get WCSS kmeans.inertia_ ```
42
Cluster seeds
We need to choose points from which to build our clusters. There's k-means++ method for this It is already integrated to KMeans
43
Cluster analysis pros and cons
``` Pros: Simple to understand Fast to cluster Widely available Easy to implement ``` ``` Cons: We need to pick K (Elbow method) Sensitive to initialization (k-means++) Sensitive to outliers (remove outliers) Produces spherical solutions (as we use euclidian distance from centroid) Standardization ```
44
Class of clusters
Flat (Kmeans) Hierarchical (Species)
45
IQR
Intra Quartal Range