notions Flashcards by No No

actionable insight

an operational insight that can be implemented

How well did you know this?

Not at all

Perfectly

bad data

garbage in - garbage out

How well did you know this?

Not at all

Perfectly

re-create analysis

is quite difficult but important

How well did you know this?

Not at all

Perfectly

bias

things that can influence the decision in the wrong wayy

How well did you know this?

Not at all

Perfectly

analytics workflow

modularity - which tools approaches are used

How well did you know this?

Not at all

Perfectly

Directed acyclic graph

s a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop.

How well did you know this?

Not at all

Perfectly

airflow

workflow manager to not user crontab

How well did you know this?

Not at all

Perfectly

Sum of squares total / regression / error

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS - sum of differences between predict value and real on

How well did you know this?

Not at all

Perfectly

Sum of squares total / regression / error

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real on

How well did you know this?

Not at all

Perfectly

Sum of squares total / regression / error

SST = SSR + ESS

How well did you know this?

Not at all

Perfectly

Sum of squares total / regression / error

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
SSE/RSS(Residual sum of square) - sum of differences between predict value and real on

SST = SSR + ESS

How well did you know this?

Not at all

Perfectly

Depndent variable

The one we are trying to predict

How well did you know this?

Not at all

Perfectly

OLS

Ordinary Least Squares

How well did you know this?

Not at all

Perfectly

OLS

Ordinary Least Squares (min SSE)

How well did you know this?

Not at all

Perfectly

R-squared

R^2 = SSR/SST, 1 is best, 0 is worst

How well did you know this?

Not at all

Perfectly

R-squared

R^2 = SSR/SST, 1 is best, 0 is worst

R-squared measures how much of the total variability is explained by this model

How well did you know this?

Not at all

Perfectly

adjusted R-squared

measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.

How well did you know this?

Not at all

Perfectly

F-statistic

How well did you know this?

Not at all

Perfectly

Linearity

Study These Flashcards

Is the function linear? Does it fit well?

No endegeneity

Study These Flashcards

???

Feature/target in ML

Study These Flashcards

independent variable(feature), used to predict dependent variable(target)

Regression intercept

Study These Flashcards

A point where regression line crosses y-axis

Regression coefficient

Study These Flashcards

the coefficient on which should we multiple the feature

p-value of the feature

Study These Flashcards

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (means that feature can be used)

F-regression

Creates regressions for a feature(in case we have many of them) and a target

mean vs average

mean is the value in 50%, average is sum divided by number of elements

standardization

Find the mean and std deviation. value-mean/deviation

underfitting/overfitting

underfitting - low accuracy(doesn't capture any logic)/ overfitting too much accuracy (capture all the noise). Can be resolved by train(75%) and test(25%) datasets.

Multicollinearity

???

Dummy variables

In case of categories(BMW, AUDI, Opel) we want to create n-1 columns with dummy variables (1 if BMW, 0 if not) In this case no new column for Opel as it is obvious that if it's not BMW or Audi it is opel

Data cleaning

Remove outliers, qunatile, remove missing values

Models: linear, quadratic, exponential, logistic

Logistic: categorical regression

MLE

maximum likelihood estimation

Clusters

Maximize the similarity in cluster and dissimilarity between clusters

Cluster analysis

unsupervised learning as we don't know the result. Classification though deals with known outcomes and can be trained on train data.

Centroid

Center of mass of all data points in cluster analysis

K-means

1) Choose K. 2) Specify seed(centroid) 3) Assign each point to the closest centroid 4) Adjust the centroid based on selected points

Cluster analysis

unsupervised learning as we don't know the result. Classification although deals with known outcomes and can be trained on train data.

Centroid

Center of the mass of all data points in cluster analysis

K-means

1) Choose K. 2) Specify seed(centroid) 3) Assign each point to the closest centroid 4) Adjust the centroid based on selected points Repeat 3

WCSS

``` determine the number of clusters Within Cluster Sum of Squares Elbow method To get WCSS kmeans.inertia_ ```

Cluster seeds

We need to choose points from which to build our clusters. There's k-means++ method for this It is already integrated to KMeans

Cluster analysis pros and cons

``` Pros: Simple to understand Fast to cluster Widely available Easy to implement ``` ``` Cons: We need to pick K (Elbow method) Sensitive to initialization (k-means++) Sensitive to outliers (remove outliers) Produces spherical solutions (as we use euclidian distance from centroid) Standardization ```

Class of clusters

Flat (Kmeans) Hierarchical (Species)

IQR

Intra Quartal Range

notions Flashcards

(45 cards)