data science - statistics I Flashcards

1
Q

EDA

A

Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

continuous data

A

data that can take any value in an interval (float)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

discrete data

A

data that can only take integer values, such as counts (int)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

categorical data

A

data that can take on only a specific set of values representing a set of possible categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

binary data

A

special subset of categorical data that with just two category values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordinal data

A

categorical data that has an explicit ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

feature

A

often a column in a table, attribute/predictor of a row of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

record

A

a row in a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

data scientists use features to predict a target, while statisticians..

A

use predictor variables in a model to predict a response/dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

trimmed mean

A

avg of all values after dropping a fixed number of extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

robust

A

not sensitive to extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

x-bar

A

sample mean of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

reasons to use a weighted mean

A

1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why is the median more robust than mean as an estimate of central tendency

A

It isn’t influenced by outliers / extreme cases that could skew the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is thought to be a compromise between mean and median

A

trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

variance

A

sum of squared deviations from the mean divided by n - 1 (aka: mean-squared-error)

17
Q

standard deviation

A

the square root of variance (aka: 12-norm, euclidean norm)

18
Q

range

A

difference between largest and smallest value in a dataset

19
Q

interquartile range

A

the difference between the 75th percentile and the 25th percentile

20
Q

mean absolute deviation

A

mean of the abs value of the deviations from the mean

21
Q

Why use n-1 instead of n when calculating variance?

A

When using n, you will underestimate the true value of the variance and the std in the population.

22
Q

what measure of variability is most robust to extremes / outliers?

A

median abso value = median (| x1 - m |, |x2 - m |,… | xN -m |)

23
Q

Graph that shows the min/max, IQR, median

A

box plot, box and whisker plot

24
Q

tally of the count of data falling into intervals/bins

A

frequency table

25
plot of the frequency table
histogram
26
smoothed version of a histogram
density plot
27
When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
expected value
28
How is the expected value calculated (2 steps)
A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean in which the weights are probabilities. The expected value is calculated as follows: 1. Multiply each outcome by its probability of occurring. 2. Sum these values. In the cloud service example, the expected value of a webinar attendee is thus $22.50 per month, calculated as follows: EV = (0.05)(300)+(0.15)(50)+(0.8)(0) == 22.5
29
Exploratory data analysis often begins with what 3 things?
1. univariate analysis 2. examining correlation among predictors (features) 3. examining correlation among features and the target
30
A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).
correlation coefficient (R)
31
A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.
correlation matrix
32
A plot in which the x-axis is the value of one variable, and the y-axis the value of another.
scatterplot
33
How do we compute the correlation coefficient?
we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviations & (n-1): (Summation of (xi - xbar)(yi - ybar)) / ((n-1) * SDx * SDy)
34
what is an example where correlation coefficient isn't a useful metric
when the relationship is not linear
35
A tally of counts between two or more categorical variables.
Contingency tables
36
A plot of two numeric variables with the records binned into hexagons.
Hexagonal binning
37
A plot showing the density of two numeric variables like a topographical map.
Contour plots
38
Similar to a boxplot but showing the density estimate.
Violin plots
39
scatterplots are good for smaller amounts of data, what are good alternatives when having large amounts of data
hexagonal binning, contour plots,heat maps