Module 1 Flashcards

(55 cards)

1
Q

Biostatistics

A

A branch of statistics that applies statistical theories and methodologies to the collection, review, and analysis of data arising from biological, agricultural, medical, and public-health-related contexts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Population

A

In statistics, a population referes to the entire group or collection of items, individuals, or elements that share a common characteristic and are of interest for the purpose of analysis or study. It encompasses every possible member of a group under consideration

For example - public health researchers conducted a comprehensive survey to assess the vaccination rates against the flu virus among the population of elderly individuals living in nursing homes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sample

A

A sample is a smaller subset of the population selected to study or analyze the characteristics of the entire population. Samples are used to make inferences and draw conclusions about the larger population without having to examine every single individual within it. The goal is to ensure that the sample is representative of the population to obtain accurate and meaningful results.

Example: public health officials collected a random sample of 500 households in the urban area to study the prevalence of air population-related respiratory illnesses among residents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variable

A

Any characteristic, number of quantity that can be measured or counted. It varies or changes from one observation to another, hence the name. Variables can be classified into different types such as categorical

Exmaple: in a public health study examining the relationship between dietary habits and the onset of diabetes, the amount to daily sugar intake could be considered a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three main components of statistics as a discipline?

A

Design - how we collect data
Description -describing data in a sample
Inference - using data from a sample to make generalizations about a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When is it appropriate to use absolute frequency vs relative frequency

A

Absolute frequency - when you want to convey the actual number of cases of occurrences. Relative when you want to provide a sense of proportion or rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages and disadvantages of transforming a continuous variable into an ordinal variable?

A

Advantages: simplicity - making it easier for non-experts to understand
Applicability - interventions can be tailored to high, medium, and low risk for example

Disadvantages
Loss of information
Arbitrary boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the types of data?

A

Nominal - qualitative and used to name or label variables without assigning any quantitative value or order. Categories are mutually exclusive and cannot be ranked or measured. Examples: Gender, nationality, types of transportation

Ordinal - Ordinal data is categorical data where variables have a natural, ordered sequence, but the intervals between categories are not necessarily equal or known. You can rank the values, but you cannot quantify the difference between them. Examples: Survey ratings (e.g., poor, fair, good, excellent), letter grades, socioeconomic status

Discrete - Discrete data is quantitative and consists of countable, indivisible values. Each data point is a distinct, separate value, often representing “the number of” something. Discrete data cannot take on every possible value within a range, only specific, separate values. Examples: Number of students in a class, number of cars in a parking lot.

Continuous - Continuous data is quantitative and can take on any value within a given range, including fractions and decimals. It represents measurements and can be infinitely subdivided. Continuous data is often obtained through precise measurement tools. Examples: Height, weight, temperature, time spent on a website

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the primary measures of location

A

Percentile - the p-th percentile is the value that p% of observations lie below

Median - the 50th percentile value (the middle value)

Mean - the average value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do median and mean interact when data is unimodal

A

Symetric: Median = Mean
Left Skewed: Mean < Median
Right Skewed: Median < Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is dispersion

A

How spread out from the center are typical observations
Range - distance between the smallest and largest
IQ Range- distance between 25% and 75%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Variance

A

Average squared distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is standard deviation

A

The square root of the variance puts it into the same units as the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is anecdotal evidence

A

Refers to unusual observations that are easily recalled because of their striking characteristics. While it cannot be used as a basis for a conclusion, it can inspire the design of a more systematic study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Simple Random Sample

A

Each member of a population has the same chance of being sampled. Each case is sampled independently of theother cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can happen with non-response

A

A non-response bias can skew the results and lead to incorrect conclusions about a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is stratified sampling

A

The population is divided into strata before cases are selected within each stratum. The strata are chosen such that similar cases are grouped together.

This is especially useful when the cases are similar with respect to the outcome of interest, but the cases between strata are different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a cluster sample

A

The population is divided into cluster,s then a fixed number of clusters is sampled and all observations from those clusters are included

Useful when high case-to-case variability, but clusters are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is multistage sampling

A

Similar to cluster sampling, but instead of keeping all observations in each cluster, a random sample is collected within each cluster

Useful when high case-to-case variability, but clusters are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the three principles on which experimental design is based? is based

A

Control - control for extraneous variables and choose a sample that is representative of the population of interest

Randomization - ensures balance and protects against bias. Allows differences in outcomes to be reasonably attributed to a treatment rather than inherent variability between patients

Replication - results from a large study are more likely to be reliable than those from small samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a confounding variable

A

A variable associated with both the explanatory and response variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the difference between a population and a sample

A

A population is the collection of individuals about whom you want to make inferences. A sample is a subset of the population on whom data is collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Simpson’s Paradox

A

An extreme example of confounding where associations observed in several groups disappear of change direction when the groups are combined

24
Q

What are the three principal points where data analysis could go wrong

A

Collection - when gathering data
Processing - when analyzing the data and its implications
Presentation - when sharing your findings with others

25
Points to a good graph
To have reliable results, any sample size quantitative analysis should include at least 200 people. Data is from a neutral source Questions should not imply or provoke a specific response Be wary of misleading averages Tracking information for each year provides a truer picture of the general trends. Avoiding biased samples Looking for truncating on an axis Percentages can hide small sample sizes Using different bases Data fishing - data dreging - the analysis of large data with the goal of findings an association that is implied as coorelation
26
What are some important characteristics of good graphs?
Should convey the general patterns in a set of observations at a single glance. Should be simple, self-explanatory, clearly labeled
27
What are Descriptive Statistics
means of organizing and summarizing observations
28
What is Nominal Data
Values fall into unordered categories or classes (sex, blood type) Proportion measurements can be used
29
What is ordinal data
When order becomes important, a natural order exists, but we still aren't concerned with the magnitude (pain scale)
30
What is ranked data
Data is arranged from highest to lowest and then assigned a rank (possible causes of death) - disregard magnitude and only care about relative position
31
What is Discrete Data
Order and magnitude are important - numbers represent actual quantities (unlike ranks) and can only take on specific values (e.g., the number of car accidents in MA in a month). These will be integers. Arithmetic can be applied
32
What is continuous data
Data that represents specific quantities but is not restricted to specific values (like integers). Eg. Time) The only limiting factor with continuous data is the degree of accuracy with which it can be measured
33
Frequency Distribution
Nominal and Ordinal data: consists of a set of classes/categories and the numerical counts Discrete or Continuous Data: 1. Breakdown into a series of non-overlapping intervals. 2. count
34
Relative frequency
the proportion of the total number of observations that appears in that interval. Useful for comparing sets of data that contain unequal numbers of observations
35
Cumulative relative frequency
The cumulative relative frequency is calculated by summing the relative frequencies for the specified interval and all previous ones
36
Stochastically ordered
Two random variables are stochastically ordered if one is more likely to take on larger values than the other, as formalized by the comparison of their probability distributions or cumulative distribution functions
37
Bar Charts
Pictorial representation of frequency distribution for nominal and ordinal data
38
Histogram
Frequency distribution for discrete or continuous data. The total area = 1 or 100%; frequency is associated with the area, not the height of the bar. RElative frequency and absolute frequency histograms will have the same shape
39
Frequency Polygon
Superior to histograms for comparing two sets of data A cumulative frequency polygon
40
One Way Scatterplots
A one-way scatter plot uses a single horizontal axis to display the relative position of each data point in the group No observations are lost, but they can be difficult to read
41
Box Plots
similar to one-way scatter plots in that they require a single axis; instead of plotting every observation, however, they display only a summary of the data Adjacent values are the most extreme observations that are not more than 1.5X the interquartile range
42
Two-way scatterplots
used to depict the relationship between two different continuous measurements.
43
Mean
The most frequently used measurement of central tendency - also called the average Not appropriate for nominal or ordinal data The mean is extremely sensitive to unusual values
44
Median
Can be used for ordinal, discrete and continuous data. It is the 50th percentile of all measurements Odd number of observations: [(n + 1)/2] - even takes the average of the two middle numbers. Only take ordering into account and is not as sensitive to unusual values (robust)
45
Mode
It can be used for all types of data. It is the set of values that occurs most frequently.
46
Measures of central tendency by shape of data
Unimodal, evenly distributed: mean, median, and mode are about the same Bimodal, recently shaped: mean, median, and mode are about the same (but the measurement would be extremely unlikely to occur) Asymmetric data - median is often the best measure
47
Range
difference between the largest observation and the smallest. Highly sensitive to exceptionality
48
Interquartile range
Not as easily impacted by extremes Calculated by subtracting the 25th percentile of the data from the 75th percentile; consequently, it encompasses the middle 50% of the observations.
49
Variance
quantifies the amount of variability, or spread, around the mean of the measurement
50
Variance Con't
variance is calculated by subtracting the mean of a set of values from each of the observations, squaring these deviations, adding them up, and dividing by 1 less than the number of observations in the data set.
51
Standard Deviation
positive square root of the variance. In a comparison of two groups of data, the group with the smaller standard deviation has the more homogeneous observations; the group with the larger standard deviation exhibits a greater amount of variability.
52
Coefficient of Variation
relates the standard deviation of a set of values to its mean. Most useful for comparing two ormore data sets
53
What is the benefit of calculating a grouped mean?
this procedure can be applied to data that have been summarized in the form of a frequency distribution. Data that is organized in this way is often referred to as grouped data. Can also be interpreted as a weighted average
54
Chebychev's inequality
can be used to summarize the distribution of values instead. Chebychev's inequality is less specific than the empirical rule, but it is true for any set of observations, no matter what its shape
55