data science - statistics I Flashcards
EDA
Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data
continuous data
data that can take any value in an interval (float)
discrete data
data that can only take integer values, such as counts (int)
categorical data
data that can take on only a specific set of values representing a set of possible categories
binary data
special subset of categorical data that with just two category values
ordinal data
categorical data that has an explicit ordering
feature
often a column in a table, attribute/predictor of a row of data
record
a row in a table
data scientists use features to predict a target, while statisticians..
use predictor variables in a model to predict a response/dependent variable
trimmed mean
avg of all values after dropping a fixed number of extreme values
robust
not sensitive to extreme values
x-bar
sample mean of a population
reasons to use a weighted mean
1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )
why is the median more robust than mean as an estimate of central tendency
It isn’t influenced by outliers / extreme cases that could skew the results
what is thought to be a compromise between mean and median
trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median
variance
sum of squared deviations from the mean divided by n - 1 (aka: mean-squared-error)
standard deviation
the square root of variance (aka: 12-norm, euclidean norm)
range
difference between largest and smallest value in a dataset
interquartile range
the difference between the 75th percentile and the 25th percentile
mean absolute deviation
mean of the abs value of the deviations from the mean
Why use n-1 instead of n when calculating variance?
When using n, you will underestimate the true value of the variance and the std in the population.
what measure of variability is most robust to extremes / outliers?
median abso value = median (| x1 - m |, |x2 - m |,… | xN -m |)
Graph that shows the min/max, IQR, median
box plot, box and whisker plot
tally of the count of data falling into intervals/bins
frequency table