L1 Flashcards
(29 cards)
What’s a population?
The entire group of individuals or items, with common characteristics, that we want to study.
What’s a sample?
A part of the population from which we actually draw a conclusion about the whole population
What’s a parameter?
An unknown population characteristic we want to estimate.
What is random sampling?
Every group of subjects from the population has the same chance of being selected (no one can be left behind).
What is non-probability sampling?
Probability for each subject being selected is unknown and may reflect a selection bias.
Example: observational studies.
What is the hierarchy of study designs?
- RCT
- Quasi experimental
- Observational (cohort/case-control)
- Cross sectional
What are the sources of variation in data?
- Natural variation - Differences between people (experimental units) in the ‘true’ values of the variable of interest
- Measurement variation (error) - Variation due to measuring equipment or measuring technique
What are the sources of bias in data?
Difference between the true value and and average value
of actual data.
Example: Under-reporting of calories eaten in food diary
What is a variable?
a characteristic of interest
in a study that could be expressed with different values for
different subjects or objects (ex. days)
What is a categorical variable?
Variables with categorical scale cannot be “measured” but are rather observed to what category it belongs. The final measurement is then a particular category (label).
Examples: race, sex, physical activity.
Nominal - cannot be ordered
Ordinal - can be ordered
What is a numerical variable?
Variables with numerical scale are measured as numerical values. They are measurements or counts taken on each subject or object.
Discrete variable - consists of only fixed values and no fractions in between them (e.g., number of siblings)
Continuous variable - consists of any fraction value in a range of numbers. (e.g., grams of alcohol consumed)
What is Mode
Mode is the most common value in a data set. It is the value that occurs the most frequently.
What is mean?
Mean is the sum of all the values divided by the number of values in the list. (always somewhere in the middle of the data range). Denoted as X bar. The mean is affected by extreme observations.
Population Mean: μ = (1/N) Σ xi
Sample Mean: x̄ = (1/n) Σ xi
Weighted mean example: X̅ = .5(10) + .3(20) + .2(15) = 5 + 6 + 3 = 14
What is median?
Median is the middle number when the values are put in order. (see-saw point). The median is better when there are extreme observations.
How it is computed:
* Arrange data from smallest to largest
* Find the middle value if there is odd number of
data
* Find the mean of the two middle values if there is
even number of data
What is the range?
Largest to smallest data
What is the variance?
Variance measures the spread of data around the mean.
Variance measures the spread in squared units.
Population Variance: σ² = (1/N) Σ (xi - μ)²
Sample Variance: s² = (1/(n-1)) Σ (xi - x̄)²
Population Variance:
First, find the mean (average) of the entire population.
Subtract the mean from each value to get the deviation of each value from the mean.
Square each of these deviations to remove any negative signs and to give more weight to larger deviations.
Sum all the squared deviations.
Divide this sum by the total number of values in the population. This gives you the average squared deviation from the mean, which is the population variance.
Sample Variance:
First, find the mean (average) of the sample.
Subtract the mean from each value in the sample to get the deviation of each value from the mean.
Square each of these deviations.
Sum all the squared deviations.
Divide this sum by one less than the number of values in the sample. This correction (dividing by n-1 instead of n) is used to account for the fact that we are working with a sample rather than the entire population. This gives you the average squared deviation from the mean, which is the sample variance.
What is the standard deviation?
SD measures the spread of data
around the mean.
SD measures in units. (not squared)
Population Standard Deviation: σ = sqrt((1/N) Σ (xi - μ)²)
First, subtract the population mean (μ) from each value (xi) to find the deviation of each value from the mean.
Square each of these deviations.
Sum all the squared deviations.
Divide this sum by the total number of values in the population (N).
Take the square root of the result to get the population standard deviation.
Sample Standard Deviation: s = sqrt((1/(n-1)) Σ (xi - x̄)²)
What are the properties of the SD?
- Large spread of data = large s (or σ)
- Small spread of data = small s (or σ)
- s (or σ) = 0 … no spread (All the data are the same)
- s (or σ) is never negative (always positive or zero)
- Units for s are the same as units for the data
- Measures how far the data tend to vary from the mean
- Provides, for a typical value, a likely “give or take” from the mean
What happens if you add a number (c) to each data point in linear operations?
New mean = old mean + c
New median = old median + c
New SD = old SD
What happens if you multiply each data by c?
New mean = old mean x c
New median = old median x c
New SD = old SD x c
What are the properties of a histogram?
Gives us a very good idea about the shape of the distribution
* Presents the measure of interest along the X-axis and the relative frequency (or frequency) on Y-axis
* Relative frequency should be used when two groups of subjects are being compared
* The area covered within a histogram represents 100% of the data
What are the properties of a box plot?
Here are the key components and what you can see on a box plot:
-
Median (Q2):
- The line inside the box represents the median, which is the middle value of the dataset when the values are arranged in ascending order. It divides the dataset into two equal halves.
-
Interquartile Range (IQR):
- The box itself represents the interquartile range, which contains the middle 50% of the data. It extends from the first quartile (Q1) to the third quartile (Q3).
- The first quartile (Q1) is the 25th percentile, meaning 25% of the data points are below this value.
- The third quartile (Q3) is the 75th percentile, meaning 75% of the data points are below this value.
-
Whiskers:
- The “whiskers” extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. These whiskers represent the range of the bulk of the data.
- The end of the lower whisker represents the smallest value within 1.5 times the IQR below Q1.
- The end of the upper whisker represents the largest value within 1.5 times the IQR above Q3.
-
Outliers:
- Data points that fall outside the range of the whiskers are considered outliers and are typically represented as individual dots or small circles. These are values significantly higher or lower than the rest of the data.
-
Minimum and Maximum:
- The minimum value (excluding outliers) is at the end of the lower whisker.
- The maximum value (excluding outliers) is at the end of the upper whisker.
What You Can See on a Box Plot:
-
Central Tendency:
- The median line inside the box shows where the center of the data lies.
-
Spread and Variability:
- The length of the box (IQR) indicates the spread of the middle 50% of the data.
- The whiskers show the spread of the bulk of the data (excluding outliers).
-
Skewness:
- If the median is closer to the bottom or top of the box, it indicates skewness in the data.
- If the whiskers are uneven in length, it suggests that the data is skewed to the left or right.
-
Outliers:
- Outliers are easily identifiable as individual points beyond the whiskers.
-
Comparison Between Groups:
- When comparing multiple box plots, you can quickly compare the central tendencies, variability, and spread of different datasets.
In summary, a box plot provides a visual summary of the distribution of a dataset, highlighting its central value, spread, skewness, and the presence of outliers.
What are the properties of bar charts?
1) Used for summarizing qualitative data
2) Bars sometimes do not touch each other
3) For each category we draw one bar
4) Height of the bar indicates percentage of data within given category or the number of data within given category
What is the definition of probability?
In probability, we are expressing a chance of a certain event occurring within a certain environment!
Assume that an experiment (a toss of a coin) can be repeated many times. The probability of a certain outcome (toss of a tail on a coin) is the number of times that outcome occurs divided by the total number of trials.