lecture 7 Flashcards
(35 cards)
Inference
Inference is the formal name given to learning from data using statistical tools.
Act of taking estimate from the sample and turning it into some impression of the unknown value in the population of interest
Parameter
The numerical measure of the quantity of interest in the population.
Parameters are generally unknown, but can be hypothetical.
true unknown value in the population is known as a parameter
find particular value at which we can no longer improve for a physiological event
Risk as a parameter
there is some true value of the relative risk of this event but we do not know it so we collect our study and estimate that value and this estimate becomes our best guess at the unknown value of the parameter
Random variable
An unknown quantity that varies in an unpredictable way.
Once a random variable is observed…
we refer to an observed or realised value.
Notation of random variables
Random variables are represented by upper case Roman letters.
Notation of observed or realised values
Lower case Roman letters represent the observed or realised value.
Random variables are described by
probability distributions
observed values of random samples are
data
Statistic
A statistic is a numerical summary of data.
Estimate
An estimate is a special kind of statistic used as an intelligent guess for a parameter.
Often estimates are denoted by adding a circumflex: μˆ is an estimate of the parameter μ.
on this course we will generally use x ̄ to denote the estimate of
the parameter μ.
A statistical model
A statistical model is a mathematical description of the way the data are generated.
- Expressed in terms of parameters and random variables.
The main types of variables are?
continuous
discrete
categorical
Continuous variables define
Continuous - can be expressed on a continuous scale in which every value is possible.
e.g. rainfall
Discrete variables define
Discrete - can be put in one-to-one correspondence with the counting numbers.
e.g. number of people in this room with brown eyes, whole numbers, don’t use decimals as it doesn’t make sense
Categorical variables define
Categorical - restricted to one of a set of categories. For example ‘Heads’ or ‘Tails’.
Binary categorical variables
Categorical variables give rise to categorical data. The simplest kind involves just two categories. For example, a person could be:
M ̄aori/non-M ̄aori. smoker/non-smoker. diabetic/non-diabetic.
Such data are also called binary data, dichotomous data, yes/no data and 0-1 data.
The 0-1 data refers to codes we use for the different outcomes. For example, 1 would typically represent participants with the outcome and 0 would represent participants without the outcome.
Categorical variable - more than two
Data are nominal if there is no natural (or relevant) ordering:
Blood group: A/B/AB/O.
Prioritised ethnicity: M ̄aori/Pacific/Asian/NZ European/Other .
Note: Ethnicity per se is not a categorical variable because people can identify with more than one ethnicity.
Data are ordinal if there is a natural ordering:
‘Degree of pain’: minimal/moderate/severe/unbearable.
‘Socio-economic deprivation’:
eg NZDep06, measured on a scale from 1 (least deprived) to 10 (most deprived).
However in this case it can be misleading to code the categories as integer values (e.g. 0,1,2,3 for ‘Degree of pain’). Is ‘unbearable’ three times more severe than ‘moderate’?
Categorical variable - more than two - nominal
Data are nominal if there is no natural (or relevant) ordering:
Blood group: A/B/AB/O.
Prioritised ethnicity: M ̄aori/Pacific/Asian/NZ European/Other .
Note: Ethnicity per se is not a categorical variable because people can identify with more than one ethnicity.
nominal - no natural order in magnitude
Categorical variable - more than two - ordinal
Data are ordinal if there is a natural ordering:
‘Degree of pain’: minimal/moderate/severe/unbearable.
‘Socio-economic deprivation’:
eg NZDep06, measured on a scale from 1 (least deprived) to 10 (most deprived).
However in this case it can be misleading to code the categories as integer values (e.g. 0,1,2,3 for ‘Degree of pain’). Is ‘unbearable’ three times more severe than ‘moderate’?- mathematical definition does not hold, be aware of this distinction
ordinal - scale based on the nature of the response
Discrete numerical
Discrete variables give rise to discrete data
With discrete data, observations take only certain numerical values, typically integers or whole numbers. For example:
number of cases of cancer diagnosed during a day. number of children in a family (0,1,2,3,4,…).
It is important to note that these are not like categorical data as the numerical representations are always consistent… e.g. 3 children is three times as many as one.
This type of data can be treated as though it is categorical if we must, but this discards information about the magnitude of the relationships between the numbers.
Continuous numerical
Continuous variables give rise to continuous data.
Continuous data arise from some form of measurement. For example:
height, age, blood pressure, serum cholesterol, …
In practice many continuous variables only take positive values. Often there is no further restriction on values other than that caused by the accuracy of the equipment for recording values. However while the underlying variable may be truly continuous, the data may be coarsened (e.g. age in years rather than age in days/hours/minutes/seconds…).
When looking at plotted data for a continuous variable you should look at …
modality (unimodal, bimodal etc) and symmetry (determine whether skew present)
determining which side the skew is on
if asymmetric the skew is in the direction of the tail
i.e. tail on the left is a left skew which can also be called a negative skew