Week 2 Flashcards
(33 cards)
Define Descriptive / Summary Statistics (5)
- A quantitive description of main features of data
- A useful summary
- Before actual analysis
- What are the players of the game?
- What are the types of variables? (discrete, continuous, categorical, dummy)
- Get a feel for your data (are there any problems?)
What to pay attention to in summary statistics? (4)
- Min value
- Max value
- Negative values
- Range
What are the measures of central tendency? (3)
Mean, median, mode
How do extreme values affect the mean, median, and mode
- Mean: Influenced by extreme values
- Median: Relatively unaffected by extreme values
- Mode: Not often affected by extreme values (unless there are identical outliers)
Which measure of central tendency do you use for the Nominal Variable?
Mode:
- The numbers in nominal variables only refer to the category
- Calculating the mean would be pointless
Which measure of central tendency do you use for the Ordinal Variable?
Median:
- Median splits to create further categories or creates dichotomies
Dichotomy
A division of 2 things that are being represented as different or opposed.
Using the interquartile range + median of an ordinal variable would split the data into 4 categories.
Example:
x<Q1 = Small, Q1<x<Median (Q2) = Small-Medium, Median (Q2) <x< Q3 = Middle-Large, x> Q3 = Large
Which measure of central tendency do you use for the Interval (scale) or Ratio Variables?
Mean or Median:
Depending on the skewness, this would indicate which central tendency to go for.
Not skewed –> Mean
Skewed –> Median
Define Skewness
- Describes the shape of the distribution –> Symmetry
- Deviation from the normal bell-shape –> (a)symmetry of a distribution
- Skewness = 0 –> Symmetric, Skewness not = 0 –> Asymmetric, Skewed
What is the name for when the skewness values go outside the -1 to +1 range?
Substantially skewed
What kind of skew is a distribution with a longer right tail?
Positively skewed
What is a negatively skewed distribution?
A distribution which has a longer tail to the left
Kurtosis
- Kurtosis describes the degree to which values are found at the tails of the distribution (compared to a normal distribution)
- Can also be seen as how pointy a distribution is (peakedness or flatness)
- It is important to mention whether a value has heavy (lepotkurtic) or light (platykurtic) tails.
Leptokurtic (kurtosis)
- This is where there are few values by the tails and, therefore is pointy (heavy-tailed)
- Kurtosis > 3
- Think of “lep”tokurtic as “leap” –> Tends to be more pointy (leaping upwards)
Platykurtic (kurtosis)
- This is where there are more values found at the tails of the distribution, therefore more rounded and flatter (light-tailed).
- Kurtosis < 3
- Think of “plat”-ykurtic as in “platform”- this is where the distribution is more rounded and flatter.
Mesokurtic (kurtosis)
This is a normal distribution.
–> Kurtosis = 3
What is the interquartile range?
This is a measure of variability.
Q1: Lower Quartile (P25, 25%)
Q2: Median (P50, 50%)
Q3: Upper Quartile (P75, 75%)
What are the measures of variablility?
Variance: The average of the squared differences between each data point and the mean.
Standard Deviation: The square root of variance
High standard deviation–> High dispersion of data points
Low standard deviation –> Low dispersion of data points
Correlation significance
Whether the correlation in population is significantly different from zero or not.
Directionality problem
A and B can be correlated because A causes B or B causes A
Third variable problem
A and B can be correlated not because A causes B or B causes A. but some unmeasured third variable, C, causes both A and B
What is the correlation coefficient?
This is a value with a range between -1 and +1, where it measures the association between two variables, but NOT the causation.
𝒚(𝒊) = 𝜷(𝟎) + 𝜷(𝟏)𝒙(𝒊) + E(i)
Define these values
y = Dependent variable
x = Independent variable
𝜷(𝟎) = Intercept / Constant
𝜷(1) = Slope or regression coefficient for the variable ‘x’.
E = Error term –> Everything that the model does not take into account
What does it mean to measure the coefficient of a specific variable in a multiple regression equation? (3)
The coefficient of each independent variable:
- Indicates change in dependent variable
- When the given independent variable changes
- But keeping all other independent variables constant (important assumption)