Biostats Flashcards

Question

conditional probability

Answer 1

The probability that an event occurs given the outcome of some other event. Usually written, Pr(A l B). For example, the probability of a person being colour blind given that the person is male is about 0.1, and the corresponding probability given that the person is female is approximately 0.0001. It is not, of course, necessary that Pr(A l B) = Pr(A l B); the probability of having spots given that a patient has measles, for example, is very high, the probability of measles given that a patient has spots is, however, much less. If Pr(A l B) = Pr(A l B) then the events A and B are said to be independent.

Answer 2

A range of values, calculated from the sample observations, that is believed, with a particular probability, to contain the true value of a population parameter. A 95% confidence interval, for example, implies that were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value. Note that the stated probability level refers to properties of the interval and not to the parameter itself which is not considered a random variable.

Answer 3

an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable . The methodologies of scientific studies therefore need to control for these factors to avoid what is known as a type 1 error : A 'false positive' conclusion that the dependent variables are in a causal relationship with the independent variable . Such a relation between two observed variables is termed a spurious relationship . Thus, confounding is a major threat to the validity of inferences made about cause and effect, i.e. internal validity , as the observed effects should be attributed to the confounder rather than the independent variable. By definition, a confounding variable is associated with both the probable cause and the outcome. The confounder is not allowed to lie in the causal pathway between the cause and the outcome: If A is thought to be the cause of disease C, the confounding variable B may not be solely caused by behaviour A; and behaviour B shall not always lead to behaviour C. An example: Being female does not always lead to smoking tobacco, and smoking tobacco does not always lead to cancer. Therefore, in any study that tries to elucidate the relation between being female and cancer should take smoking into account as a possible confounder. In addition, a confounder is always a risk factor that has a different prevalence in two risk groups (e.g. females/males). (Hennekens, Buring & Mayrent, 1987).

Answer 4

The table arising when observations on a number of categorical variables are cross-classified. Entries in each cell are the number of individuals with the corresponding combination of variable values. Most common are two-dimensional tables involving two categorical variables. The analysis of such two-dimensional tables generally involves testing for the independence of the two variables using the familiar chi-squared statistics. Three- and higher-dimensional tables are now routinely analyzed using log-linear models.

Answer 5

result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions or jumps, e.g. blood pressure.

Answer 6

A Phase III clinical trial in which an experimental treatment is compared with a control treatment, the latter being either the current standard treatment or a placebo.

Answer 7

An index that quantifies the linear relationship between a pair of variables. In a bivariate normal distribution, for example, the parameter, p. An estimator of p obtained from n sample values of the two variables of interest, (x1, y1), (x2, y2),…,(xn,yn), is Pearson's product moment correlation coefficient, r, given by The coefficient takes values between -1 and 1, with the sign indicating the direction of the relationship and the numerical magnitude its strength. Values of -1 and 1 indicate that the sample values fall on a straight line. A value of zero indicates the lack of any linear relationship between the two variables.

Answer 8

Often used simply as an alternative name for explanatory variables, but perhaps more specifically to refer to variables that are not of primary interest in an investigation, but are measured because it is believed that they are likely to affect the response variable and consequently need to be included in analyses and model building.

Answer 9

A statistical model used in survival analysis developed by D.R. Cox in 1972 asserting that the effect of the study factors on the hazard rate in the study population is multiplicative and does not change over time.

Answer 10

The value with which a statistic calculated from sample data is compared in order to decide whether a null hypothesis should be rejected. The value is related to the particular significance level chosen.

Answer 11

The proportion of patients in a clinical trial transferring from the treatment decided by an initial random allocation to an alternative one.

Answer 12

(Syn: disease frequency survey, prevalence study) A study that examines the relationship between diseases (or other health-related characteristics) and other variables of interest as they exist in defined population at one particular time.

Answer 13

The tabulation of a sample of observations in terms of numbers falling below particular values. The empirical equivalent of the cumulative probability distribution. An example of such a tabulation is shown below.

Answer 14

An elusive concept that occurs throughout statistics. Essentially the term means the number of independent units of information in a sample relevant to the estimation of a parameter or calculation of a statistic. For example, in a two-by-two contingency table with a given set of marginal totals, only one of the four cell frequencies is free and the table has therefore a single degree of freedom. In many cases the term corresponds to the number of parameters in a model. Also used to refer to a parameter of various families of distributions, for example, Student's t-distribution and the F-distribution.

Answer 15

The variable of primary importance in investigations since the major objective is usually to study the effects of treatment and/or other explanatory variables on this variable and to provide suitable models for the relationship between it and the explanatory variables.

Answer 16

A general term for methods of summarizing and tabulating data that make their main features more transparent. For example, calculating means and variances and plotting histograms.

Answer 17

A nominal measure with two outcomes (examples are gender male or female; survival yes or no); also called binary. See dichotomous data.

Answer 18

one that arranges items into either of two mutually exclusive categories, e.g. yes/no, alive/dead.

Answer 19

result when the number of possible values is either a finite number or a “countable” number.

Answer 20

a countable and finite variable, for example grade: | 1, 2, 3, 4...- 12.

Answer 21

In statistics this term is used for any finite or infinite collection of ‘units', which are often people but may be, for example, institutions, events, etc.

Answer 22

A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment has been given the trial is termed double-blind.

Answer 23

Dummy coding provides one way of using categorical predictor variables in various kinds of estimation models (see also effect coding), such as, linear regression. Dummy coding uses only ones and zeros to convey all of the necessary information on group membership. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm

Answer 24

in statistics, a variable taking only one of two possible values, one (usually 1) indicating the presence of a condition, and the other (usually 0) indicating the absence of the condition, used mainly in regression analysis.

Answer 25

a measure of the strength of the relationship between two variables. In scientific experiments, it is often useful to know not only whether an experiment has a statistically significant effect, but also the size of any observed effects. In practical situations, effect sizes are helpful for making decisions. Effect size measures are the common currency of meta-analysis studies that summarize the findings from a specific area of research.

Answer 26

The sample size after dropouts, deaths and other specified exclusions from the original sample.

Answer 27

A term usually encountered in the analysis of contingency tables. Such frequencies are estimates of the values to be expected under the hypothesis of interest. In a two-dimensional table, for example, the values under independence are calculated from the product of the appropriate row and column totals divided by the total number of observations.

Answer 28

A probability experiment involves performing a number of trials to measure the chance of the occurrence of an event our outcome. http://www.uic.edu/classes/upp/upp503/sanders4-5.pdf

Answer 29

study in which the investigator intentionally alters one or more factors under controlled conditions in order to study the effects of doing so.

Answer 30

The variables appearing on the right-hand size of the equations defining, for example, multiple regression or logistic regression, and which seek to predict or ‘explain' the response variable. Also commonly known as the independent variables, although this is not to be recommended since they are rarely independent of one another.

Answer 31

An event, characteristic, or other definable entity that brings about a change in a health condition or other defined outcome.

Answer 32

A set of statistical methods for analyzing the correlations among several variables in order to estimate the number of fundamental dimensions that underlie the observed data and to describe and measure those dimensions. Used frequently in the development of scoring systems for rating scales and questionnaires.

Answer 33

Designs which allow two or more questions to be addressed in an investigation. The simplest factorial design is one in which each of two treatments or interventions are either present or absent, so that subjects are divided into four groups; those receiving neither treatment, those having only the first treatment, those having only the second treatment and those receiving both treatments. Such designs enable possible interactions between factors to be investigated. A very important special case of a factorial design is that where each of k factors of interest has only two levels; these are usually known as 2kfactorial designs. A single replicate of a 2kdesign is sometimes called an unreplicated factorial.

Answer 34

The proportion of cases in which a diagnostic test indicates disease is absent in patients who have the disease. v

Answer 35

The proportion of cases in which a diagnostic test indicates disease is present in disease-free patients.

Answer 36

The distribution of the ratio of two independent quantities each of which is distributed like a variance in normally distributed samples. So named in honor of R.A. Fisher who first described the distribution.

Answer 37

An alternative procedure to use of the chi-squared statistic for assessing the independence of two variables forming a two-by-two contingency table particularly when the expected frequencies are small. The method consists of evaluating the sum of the probabilities associated with the observed table and all possible two-by-two tables that have the same row and column totals as the observed data but exhibit more extreme departure from independence. The probability of each table is calculated from the hypergeometric distribution.

Answer 38

A transformation of Pearson's product moment correlation coefficient, r, given by z = 1/2 1n[(1+r)/(1-r)] The statistic z has a normal distribution with mean 1/2 1n[(1+p)/(1-p)] where ? is the population correlation value and variance 1/( n -3) where n is the sample size. The transformation may be used to test hypotheses and to contrast confidence intervals for ?.

Answer 39

a general term describing the frequency or occurrence of a disease or other attribute or event in a population without distinguishing between incidence and prevalence.

Answer 40

lists data values (either individually or by groups of intervals), along with their corresponding frequencies (or counts).

Answer 41

a way of summarizing data; used as a record of how often each value (or set of values) of a variable occurs. A frequency table is used to summarize categorical, nominal, and ordinal data. It may also be used to summarize continuous data once the data is divided into categories.

Answer 42

A test for the equality of the variances of two populations having normal distributions, based on the ratio of the variances of a sample of observations taken from each. Most often encountered in the analysis of variance , where testing whether particular variances are the same also test for the equality of a set of means.

Answer 43

A term usually retained for those clinical trials in which there is random allocation to treatments, a control group and double-blinding.

Answer 44

Degree of agreement between an empirically observed distribution and a mathematical or theoretical distribution.

Answer 45

A statistical test of the hypothesis that data have been randomly sampled or generated from a population that follows a particular theoretical distribution or model. The most common such tests are chi-square tests.

Answer 46

Inherent capability of an agent or situation to have an adverse effect. A factor or exposure that may effect adversely effect health.

Answer 47

A theoretical measure of the risk of an occurrence of an event, e.g. death or new disease, at a point in time, t , defined mathematically as the limit, as Δ t approaches zero, of the probability that an individual well at time t will experience the event by t + Δ t , divided by Δ t .

Answer 48

A graphical representation of a set of observations in which class frequencies are represented by the areas of rectangles centred on the class interval. If the latter are all equal, the heights of the rectangles are also proportional to the observed frequencies. A histogram of heights of elderly women is shown (see below).

Answer 49

A group of patients treated in the past with a standard therapy, used as the control group for evaluating a new treatment on current patients. Although used fairly frequently in medical investigations, the approach is not to be recommended since possible biases, due to other factors that may have changed over the time, can never be satisfactory eliminated.

Answer 50

A term that is used in statistics to indicate the equality of some quantity of interest (most often a variance), in a number of different groups, populations, etc.

Answer 51

homo means “same” and –scedastic means “scattered” therefore homoscedasticity means the constancy of the variance of a measure over the levels of the factors under study.

Answer 52

A general term for the procedure of assessing whether sample data is consistent or otherwise with statements made about the population.

Answer 53

A measure of the rate at which people without a disease develop the disease during a specific period of time. Calculated as incidence = # new cases over a period of time/population at risk in the time period it measures the appearance of disease. More generally, the number of new events, e.g. new cases of a disease in a specified population, within a specified period of time. The term incidence is sometimes wrongly used to denote incidence rate.

Answer 54

Two events are said to be independent if the occurrence of one is in no way predictable from the occurrence of the other. Two variables are said to be independent if the distribution of values of one is the same for all values of the other.

Answer 55

The variables appearing on the right-hand side of the equations defining, for example, multiple regression or logistic regression, and which seek to predict or ‘explain' the response variable. Using the term independent variable is not recommended since they are rarely independent of one another.

Answer 56

The process of drawing conclusions about a population on the basis of measurements or observations made on a sample of individuals for the population.

Answer 57

A term applied when two (or more) explanatory variables do not act independently on a response variable. The graphic below shows an example from a 2 x 2 factorial design. In statistics, interaction is also the necessity for a product term in a linear model.

Answer 58

The parameter in an equation derived from a regression analysis corresponding to the expected value of the response variable when all the explanatory variables are zero.

Answer 59

A measure of spread given by the difference between the first and third quartiles of a sample.

Answer 60

the degree of agreement among raters. It gives a score of how much homogeneity or consensus there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained. There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.

Answer 61

A study in which conditions are under the direct control of the investigator. In epidemiology, a study in which a population is selected for a planned trial of a regimen whose effects are measured by comparing the outcome of the regimen in the experimental group with the outcome of another regimen in a control group.

Answer 62

A nonparametric method of compiling life or survival tables. This combines calculated probabilities of survival and estimates to allow for censored observations, which are assumed to occur randomly. The intervals are defined as ending each time an event (death, withdrawal) occurs and are therefore unequal.

Answer 63

A measure of the degree of nonrandom agreement between observers or measurements of the same categorical variable K=Po–Pe/1-Pe where Po is the proportion of times the measurements agree, and Pe is the proportion of times they can be expected to agree by chance alone. If the measurements agree more often than expected by chance, kappa is positive; if concordance is complete, kappa = 1; if there is no more nor less than chance concordance, kappa = 0; if the measurements disagree more than expected by chance, kappa is negative.

Answer 64

the extent to which a unimodal distribution is peaked.

Answer 65

A principle of estimation, attributable to Gauss, in which the estimates of a set of parameters in a statistical model are those quantities that minimize the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

Answer 66

The level of probability at which it is agreed that the null hypothesis will be rejected. Conventionally set at 0.05.

Answer 67

A procedure often applied in prospective studies to examine the distribution of mortality and/or morbidity in one or more diseases in a cohort study of patients over a fixed period of time. For each specific increment in the follow-up period, the number entering the period, the number leaving during the period, and the number either dying from the disease (mortality) or developing the disease (morbidity), are all calculated. It is assumed that an individual not completed the follow-up period is exposed for half this period, thus enabling the data for those ‘leaving' and those ‘staying' to be combined into an appropriate denominator for the estimation of the percentage dying from or developing the disease. The advantage of this approach is that all patients, not only those who have been involved for an extended period, can be included in the estimation process.

Answer 68

A function constructed from a statistical model and a set of observed data that gives the probability of the observed data for various values of the unknown model parameters. The parameter values that maximize the probability are the maximum likelihood estimates of the parameters.

Answer 69

The ratio of the likelihood of observing data under actual conditions, to observing these data under the other, e.g., “ideal” conditions; or comparison of various model conditions to assess which model provides the best fit. Likelihood ratios are used to appraise screening and diagnostic tests in clinical epidemiology.

Answer 70

A statistical test based on the ratio of the maximum value of the likelihood function under one statistical model to the maximum value under another statistical model; the models differ in that one includes and the other excludes one or more parameters.

Answer 71

a form of regression analysis in which observational data are modeled by a function which is a linear combination of the model parameters and depends on one or more independent variables. In simple linear regression the model function represents a straight line. The results of data fitting are subject to statistical analysis. The data consist of m values taken from observations of the dependent variable (response variable) y . The independent variables are also called regressors, exogenous variables, input variables and predictor variables. In simple linear regression the data model is written as yi = ß0 + xiß1+ εi where εi is an observational error. ß0 (intercept) and ß1 (slope) are the parameters of the model.

Answer 72

A form of regression analysis used when the response variable is a binary variable.The method is based on the logistic transformation or logit of a proportion, namely logit(p) = ln(p/(1-p)) As p tend to O, logit ( p ) tends to -∞ and as p tends to 1, logit (p) tends to ∞. The function logit (p) is a sigmoid curve that is symmetric about p = 0.5. Applying this transformation, this form of regression is written as; ln(p/(1-p))= ß0 + ß1x1+...+ ßqxq where p = Pr(dependent variable=1) and x1, x2,,,, xq are the explanatory variables. Using the logistic transformation in this way overcomes problems that might arise if p was modeled directly as a linear function of the explanatory variables, in particular it avoids fitted probabilities outside the range (0,1). The parameters in the model can be estimated by maximum likelihood estimation.

Answer 73

A statistical model of an individual's risk (probability of disease y ) as a function of a risk factor x : P( y | x ) = 1/(1 + e -a-βx) where e is the (natural) exponential function. This model has a desirable range, 0 to 1, and other attractive statistical features. In the multiple logistic model, the term βx is replaced by a linear term involving several factors, e.g., β1 x1 + β2 x2 if there are two factors x1 and x2.

Answer 74

the logarithm of the ratio of frequencies of two different categorical outcomes such as healthy versus sick.

Answer 75

A linear model for the logit (natural log of the odds) of disease as a function of a quantitative factor X: Logit (disease given X = x ) = α + β x This model is mathematically equivalent to the logistic model.

Answer 76

A statistical model that uses an analysis of variance type of approach for the modeling of frequency counts in contingency tables.

Answer 77

A test for comparing two or more sets of survival times, to assess the null hypothesis that there is no difference in the survival experience of the individuals in the different groups

Answer 78

Studies that give rise to longitudinal data. The defining characteristic of such a study is that subjects are measured repeatedly through time.

Answer 79

Mantel and Haenszel provided an adjusted odds ratio as an estimate of relative risk that may be derived from grouped and matched sets of data. It is now known as the Mantel-Haenszel estimate. The statistic may be regarded as a type of weighted average of the individual odds ratios, derived from stratifying a sample into a series of strata that are internally homogeneous with respect to confounding factors. The Mantel-Haenszel summarization method can also be extended to the summarization of rate ratios and rate differences from follow-up studies. An estimator of the assumed common odds ratio in a series of two-by-two contingency tables arising from different populations, for example, occupation, country of origin, etc.

Answer 80

A summary chi-square test developed by Mantel and Haenszel for stratified data and used when controlling for confounding.

Answer 81

the row and column totals of a contingency table.

Answer 82

The process of making a study group and a comparison group comparable with respect to extraneous factors. Often used in retrospective studies when selecting cases and controls to control variation in a response variable due to sources other than those immediately under investigation. Several kinds of matching can be identified, the most common of which is when each case is individually matched with a control subject on the matching variables, such as age, sex, occupation, etc. When the variable on which the matching takes place is continuous it is usually transformed into a series of categories (e.g. age), but a second method is to say that two values of the variable match if their difference lies between defined limits. This method is known as caliper matching. Also important is group or category matching in which the distributions of the extraneous factors are made similar in the groups to be compared.

Answer 83

the value for an unknown parameter that maximizes the probability of obtaining exactly the data that were observed. Used to solve logistic regression.

Answer 84

A test for comparing proportions in data involving paired samples. The test statistic is given by x^2 = [(b-c)^2]/(b+c) where b is the number of pairs for which the individual receiving treatment A has a positive response and the individual receiving treatment B does not, and c is the number of pairs for which the reverse is the case. If the probability of a positive response is the same in each group, then X2 has a chi-squared distribution with a single degree of freedom.

Answer 85

A measure of location or central value for a continuous variable. For a definition of the population value see expected value.

Answer 86

the expected value of the square of the difference between an estimator and the true value of a parameter. If the estimator is unbiased then the mean squared error (MSE) is simply the variance of the estimator. For a biased estimator the MSE is equal to the sum of the variance and the square of the bias.

Answer 87

A mismatch between an estimated value and its true value. Can be observed when using multiple measures of the same entity or concept.

Answer 88

the range of possible values for a measurement (e.g. the set of possible responses to a question, the physically possible range for a set of body weights). Measurement scales can be classified according to the quantitative character of the scale: dichotomous scale – one that arranges items into either of two mutually exclusive categories, e.g. yes/no, alive/dead. nominal scale – classification into unordered qualitative categories, e.g. race, religion, country of birth. Measurements of individual attributes are purely nominal scales, as there is no inherent order to their categories. ordinal scale – classification into ordered qualitative categories, e.g. grade, where the values have a distinct order but their categories are qualitative in that there is no natural (numerical) distance between their possible values. interval scale –an equal interval involves assignment of values with a natural distance between them, so that a particular distance (interval) between two values in another region of the scale. Examples include Celsius and Fahrenheit temperature, date of birth. ratio scale – a ratio is an interval scale with a true zero point, so that ratios between values are meaningfully defined. Examples are absolute temperature, weight, height, blood count, and income, as in each case it is meaningful to speak of one value as being so many times greater or less than another value.

Answer 89

A general term for several values of the distribution of a set of values or measurements located at or near the middle of the set. The principal measures of central tendency are the mean, median, and mode.

Answer 90

The value in a set of ranked observations that divides the data into two parts of equal size. When there is an odd number of observations the median is the middle value. When there is an even number of observations the measure is calculated as the average of the two central values. Provides a measure of location of a sample that is suitable for asymmetric distributions and is also relatively insensitive to the presence of outliers.

Answer 91

A collection of techniques whereby the results of two or more independent studies are statistically combined to yield an overall answer to a question of interest. The rationale behind this approach is to provide a test with more power than is provided by the separate studies themselves. The procedure has become increasingly popular in the last decade or so but it is not without its critics particularly because of the difficulties of knowing which studies should be included and to which population final results actually apply.

Answer 92

The most frequently occurring value in a set of observations. Occasionally used as a measure of location.

Answer 93

in multiple regression analysis, a situation in which at least some of the independent variables are highly correlated with each other. Such a situation can result in inaccurate estimates of the parameters in the regression model.

Answer 94

the probability distribution associated with the classification of each of a sample of individuals into one of several mutually exclusive and exhaustive categories. When the number of categories is two, the distribution is called binomial.

Answer 95

Procedures for detailed examination of the differences between a set of means, usually after a general hypothesis that they are all equal has been rejected. No single technique is best in all situations and a major distinction between techniques is how they control the possible inflation of the type I error.

Answer 96

A term usually applied to models in which a continuous response variable, y, is regressed on a number of explanatory variables, x1,x2,....xq.

Answer 97

a set of techniques used when the variation in several variables has to be studied simultaneously. In statistics any analytic method that allows the simultaneous study of two or more dependent variables.

Answer 98

Data for which each observation consists of values for more than one random variable. For example, measurements on blood pressure, temperature and heart rate for a number of subjects.

Answer 99

Events that cannot occur jointly.

Answer 100

classification into unordered qualitative categories, e.g. race, religion, country of birth. Measurements of individual attributes are purely nominal scales, as there is no inherent order to their categories.

Answer 101

Statistical techniques of estimation and inference that are based on a function of the sample observations, the probability distribution of which does not depend on a complete specification of the probability distribution of the population from which the sample was drawn. Consequently the techniques are valid under relatively general assumptions about the underlying population. Often such methods involve only the ranks of the observations rather than the observations themselves. Examples are Wilcoxon's signed rank test and Friedman's two way analysis of variance. In many cases these tests are only marginally less powerful than their analogues which assume a particular population distribution (usually a normal distribution), even when that assumption is true. Also commonly known as nonparametric methods although the terms are not completely synonymous.

Answer 102

A clinical trial in which a series of consecutive patients receive a new treatment and those that respond (according to some pre-defined criterion) continue to receive it. Those patients that fail to respond receive an alternative, usually the conventional, treatment. The two groups are then compared on one or more outcome variables. One of the problems with such a procedure is that patients who respond may be healthier than those who do not respond, possibly resulting in an apparent but not real benefit of treatment.

Answer 103

A probability distribution, f(x), of a random variable, X, that is assumed by any statistical methods.

Answer 104

The ‘no difference' or ‘no association' hypothesis to be tested (usually by means of a significance test) against an alternative hypothesis that postulates non-zero difference or association.

Answer 105

A study in which the objective is to uncover cause-and-effect relationships but in which it is not feasible to use controlled experimentation, in the sense of being able to impose the procedure or treatments whose effects it is desired to discover, or to assign subjects at random to different procedures. Surveys and most epidemiologic studies fall into this class. Since the investigator does not control the assignment of treatments there is no way to ensure that similar subjects receive different treatments. The classical example of such a study that successfully uncovered evidence of an important causal relationship is the smoking and lung cancer investigation of Doll and Hill.

Answer 106

variation (or error) due to failure of the observer to measure or identify a phenomenon accurately. Observer variation erodes scientific credibility whenever it appears. There are two varieties of observer variation: interobserver variation, i.e. the amount observers vary from one another when reporting on the same material, and intraobserver variation, i.e. the amount one observer varies between observations when reporting more than once on the same material.

Answer 107

the ratio of the probability of occurrence of an event to that of nonoccurrence (a binary variable), or the ratio of the probability that something is so to the probability that it is not so.

Answer 108

The ratio of two odds for a binary variable in two groups of subjects, for example, males and females. If the two possible states of the variable are labeled ‘success' and ‘failure' then the odds ratio is a measure of the odds of a success in one group relative to that in the other. When the odds of a success in each group are identical then the odds ratio is equal to one. Usually estimated as ad/bc

Answer 109

A significance test for which the alternative hypothesis is directional; for example, that one population mean is greater than another. The choice between a one-sided and two-sided test must be made before any test statistic is calculated.

Answer 110

classification into ordered qualitative categories, e.g. grade, where the values have a distinct order but their categories are qualitative in that there is no natural (numerical) distance between their possible values.

Answer 111

observations differing so widely from the rest of the data as to lead one to suspect that a gross error may have been committed, or suggesting that these values come from a different population. Statistical handling of outliers varies and is difficult.

Answer 112

A Student's t-test for the equality of the means of two populations, when the observations arised as paired samples. The test is based on the differences between the observations of the matched pairsv

Answer 113

A numerical characteristic of a population or a model. The probability of a ‘success' in a binomial distribution, for example.

Answer 114

a statistical test that depends upon assumptions about the distribution of the data, e.g. that the data are normally distributed.

Answer 115

a way of expressing a number as a fraction of 100 (per cent meaning "per hundred").

Answer 116

The set of divisions that produce exactly 100 equal parts in a series of continuous values, such as blood pressure, weight, height, etc. Thus a person with blood pressure above the 80th percentile has a greater blood pressure value than over 80% of the other recorded values.

Answer 117

A treatment designed to appear exactly like a comparison treatment, but which is devoid of the active component.

Answer 118

The process of providing a numerical value for a population parameter on the basis of information collected from a sample. If a single figure is calculated for the unknown parameter the process is called point estimation. If an interval is calculated which is likely to contain the parameter, then the procedure is called interval estimation.

Answer 119

The probability distribution of the number of occurrences, X, of some random event, in an interval of time or space

Answer 120

In statistics this term is used for any finite or infinite collection of ‘units', which are often people but may be, for example, institutions, events, etc.

Answer 121

Analyses not explicitly planned at the start of a study but suggested by an examination of the data. Such comparisons are generally performed only after obtaining a significant overall F value.

Answer 122

The probability of rejecting the null hypothesis when it is false. Power gives a method of discriminating between competing test of the same hypothesis, the test with the higher power being preferred. It is also the basis of procedures for estimating the sample size needed to detect an effect of a particular magnitude. Mathematically, power is 1-β (type II error).

Answer 123

In screening and diagnostic tests, the probability that a person with a positive test is a true positive (i.e., does have the disease) is referred to as the “predictive value of a positive test.” The predictive value of a negative test is the probability that a person with a negative test does not have the disease. The predictive value of a screening test is determined by the sensitivity and specificity of the test, and by the prevalence of the condition for which the test is used.

Answer 124

the probability that a person with a negative test does not have the disease.

Answer 125

the probability that a person with a positive test is a true positive (i.e. does have the disease).

Answer 126

A measure of the number of people in a population who have a particular disease at a given point in time. Can be measured in two ways, as point prevalence (# of cases at a particular moments/population at a particular moment) and period prevalence (# of cases during a specified time period/# in population at midpoint of period). Essentially measure the existence of a disease.

Answer 127

a statistical method to simplify the description of a set of interrelated variables. Its general objectives are data reduction and interpretation; there is no separation into dependent and independent variables; the original set of correlated variables is transformed into a smaller set of uncorrelated variables called the principal components. Often used as the first step in a factor analysis.

Answer 128

A measure associated with an event A and denoted by Pr(A) which takes a value such that 0 ≤ Pr(A) ≤ 1. Essentially the quantitative expression of the chance than an event will occur. In general the higher the value of Pr(A) the more likely It is that the event will occur. If the event cannot happen Pr(A) = 0; if an event is certain to happen Pr(A) = 1. Numerical values can be assigned in simple cases by one of the following two methods: If the sample space can be divided into subsets of n (n ≥ 2) equally likely outcomes and the event A is associated with r (0 ≤ r ≤ n) of these, then Pr(A) = r / n. If an experiment can be repeated a large number of times, n, and in r cases the event A occurs, then r / n is called the relative frequency of A. If this leads to a limit as n ? 8, this limit is Pr(A).

Answer 129

For a discrete random variable, a mathematical formula that gives the probability of each value of the variable. See, for example, binomial distribution and Poisson distribution. For a continuous random variable, a curve described by a mathematical formula which specifies, by ways of areas under the curve, the probability that the variable falls within a particular interval. Examples include the normal distribution and the exponential distribution. In both cases the term probability density may also be used. (A distinction is sometimes made between ‘density' and ‘distribution', when the latter is reserved for the probability that the random variable falls below some value. In this dictionary, however, the latter will be termed the cumulative probability distribution and probability distribution and probability density used synonymously.

Answer 130

A type of ratio in which the numerator is included in the denominator.

Answer 131

A method that allows the hazard function to be modeled on a set of explanatory variables without making restrictive assumptions about the dependence of the hazard function on time. The model involved is ln h(t) = ln α(t) + β1x1+β2x2+...+βqxq where x1, x2, …,xq are the explanatory variables of interest, and h(t) the hazard function. The so-called baseline hazard function, a(t), is an arbitrary function of time. For any two individuals at any point in time the ratio of the hazard functions is a constant. Because the baseline hazard function, a(t), does not have to be specified explicitly, the procedure is essentially a distribution free method. Estimates of the parameters in the model, i.e. ß1, ß2,…,ßq are usually obtained by maximum likelihood estimation, and depend only on the order in which events occur, not on the exact times of their occurrence.

Answer 132

Studies in which individuals are followed-up over a period of time. A common example of this type of investigation is where samples of individuals exposed and not exposed to a possible risk factor for a particular disease, are followed forward in time to determine what happens to them with respect to the illness under investigation. At the end of a suitable time period a comparison of the incidence of the disease amongst the exposed and non-exposed is made. A classical example of such a study is that undertaken among British doctors in the 1950s, to investigate the relationship between smoking and death from lung cancer. All clinical trials are prospective.

Answer 133

the probability that a test statistic would be as extreme as or more extreme than observed if the null hypothesis were true.

Answer 134

1. observations or information characterized by measurement on a categorical scale, i.e. a dichotomous (non-numeric) or nominal scale, or if the categories are ordered, an ordinal scale. Examples are sex, hair color, death or survival. 2. systematic non-numerical observations by sociologists, anthropologists, etc. using approved methods such as participant observation or key informants.

Answer 135

Divisions of a probability distribution or frequency distribution into equal, ordered subgroups, for example, quartiles or percentiles.

Answer 136

The values that divide a frequency distribution or probability distribution into four equal parts.

Answer 137

The variation in a data set unexplained by identifiable sources.

Answer 138

Allocation of individuals to groups, e.g., for experimental and control regimens, by chance.

Answer 139

an epidemiologic experiment in which subjects in a population are randomly allocated into groups, usually called study and control groups, to receive or not receive an experimental preventive or therapeutic procedure, maneuver, or intervention. The results are assessed by rigorous comparison of rates of disease, death, recovery, or other appropriate outcome in the study and control groups. RCTs are generally regarded as the most scientifically rigorous method of hypothesis testing available in epidemiology.

Answer 140

Either a set of n independent and identically distributed random variables, or a sample of n individuals selected from a population in such a way that each sample of the same size is equally likely.

Answer 141

A variable, the values of which occur according to some specified probability distribution.

Answer 142

The difference between the largest and smallest observations in a data set. Often used as an easy-to-calculate measure of the dispersion in a set of observations but not recommended for this task because of its sensitivity to outliers and the fact that its value increases with sample size.

Answer 143

The relative positions of the members of a sample with respect to some characteristic.

Answer 144

A measure of the frequency of some phenomenon of interest given by # of events in a specified period/avg. population during the period

Answer 145

The value obtained by dividing one quantity by another: a general term of which rate, proportion, percentage, etc., are subsets. The important difference between a proportion and a ratio is that the numerator of a proportion is included in the population defined by the denominator, whereas this is not necessarily so for a ratio.

Answer 146

a graphic means for assessing the ability of a screening test to discriminate between healthy and diseased persons. The term receiver operating characteristic comes from psychometry, where the characteristic operating response of a receiver-individual to faint stimuli or nonstimuli was recorded.

Answer 147

As used by Francis Galton (1822-1911) one of the founders of modern biology and biometry, in his book Hereditary Genius (1869), this meant the tendency of offspring of exceptional parents to possess characteristics closer to the average for the general population. Hence “regression to the mean,” i.e. the tendency of individuals at the extremes to have values nearer to the mean on repeated measurement. Can also be a synonym for regression analysis in statistics.

Answer 148

given data on a dependent variable y and one or more independent or predictor variables x1, x2, etc., regression analysis involves finding the “best” mathematical model (within some restricted class of models) to describe y as a function of the x's, or to predict y from the x's. The most common form is a linear model; in epidemiology, the logistic and proportional hazards models are also common.

Answer 149

A term usually applied to models in which a continuous response variable, y, is regressed on a number of explanatory variables, x1, x2,…,xq. Explicitly the model fitted is E (y) = β0+β1x1+β2x2+...+βqxq the model for n observations can be written as where contains the residual error terms and . Least squares estimation of the parameters involves the following set of equations The regression coefficients ß1, ß2,…,ßq give the change in the response variable corresponding to a unit change in the appropriate explanatory variable, conditional on the other variables remaining constant. Significance tests of whether the coefficients take the value zero can be derived on the assumption that for a given set of values of the explanatory variables, y has a normal distribution with constant variance.

Answer 150

A measure of the association between exposure to a particular factor and risk of a certain outcome, calculated as incidence rate among exposed/incidence rate among nonexposed. Thus a relative risk of 5, for example, means that an exposed person is 5 times as likely to have the disease than one who is not exposed. Relative risk does not measure the probability that someone with the factor will develop the disease. The disease may be rare among both the nonexposed and the exposed.

Answer 151

The extent to which the same measurements of individuals obtained under different conditions yield similar results. Reliability refers to the degree to which the results obtained by a measurement, procedure can be replicated. Lack of reliability may arise from divergences between observers or instruments of measurement or instability of the attribute being measured.

Answer 152

Repeated measures is a type of analysis of variance that generalizes Student's t test for paired samples. It is used when two or more measurements of the same type are made on the same subject. Analysis of variance is characterized by the use of factors, which are composed of levels. Repeated measures analysis of variance involves two types of factors--between subjects factors and within subjects factors. The repeated measures make up the levels of the within subjects factor. For example, suppose each subject has his/her reaction time measured under three different conditions. The conditions make up the levels of the within subjects factor. Depending on the study, subjects may divided into groups according to levels of other factors called between subjects factors. Each subject is observed at only a single level of a between-subjects factor. For example, if subjects were randomized to aeorbic or stretching exercise, form of exercise would be a between-subjects factor. The levels of a within-subject factor change as we move within a subject, while levels of a between-subject factor change only as we move between subjects.

Answer 153

The difference between the observed value of a response variable (yi) and the value predicted by some model of interest ( ). Examination of a set of residuals, usually by informal graphical techniques, allows the assumptions made in the model fitting exercise, for example, normality, homogeneity of variance, etc., to be checked. Generally, discrepant observations have large residuals, but some form of standardization may be necessary in many situations to allow identification of patterns among the residuals that may be a cause for concern.

Answer 154

The variable of primary importance in investigations since the major objective is usually to study the effects of treatment and/or other explanatory variables on this variable and to provide suitable models for the relationship between the explanatory variables.

Answer 155

A general term for studies in which all the events of interest occur prior to the onset of the study and findings are based on looking backward in time. Most common is the case-control study, in which comparisons are made between individuals who have a particular disease or condition (the cases) and individuals who do not have the disease (the controls). A sample of cases is selected from the population of individuals who have the disease of interest and a sample of controls is taken from among those individuals known not to have the disease. Information about possible risk factors for the disease is then obtained retrospectively for each person in the study by examining past records, by interviewing each person and/or interviewing their relatives, or in some other way. In order to make the cases and controls otherwise comparable, they are frequently matched on characteristics known to be strongly related to both disease and exposure leading to a matched case-control study. Age, sex and socioeconomic status are examples of commonly used matching variables. Also commonly encountered is the retrospective cohort study, in which a past cohort of individuals are identified from previous information, for example, employment records, and their subsequent mortality or morbidity determined and compared with the corresponding experience of some suitable control group.

Answer 156

An aspect of persona behavior or lifestyle, an environmental exposure, or an inborn or inherited characteristic which is thought to be associated with a particular disease or condition.

Answer 157

The ratio of two risks, usually exposed/not exposed.

Answer 158

``` a selected subset of a population. A sample may be random or nonrandom and may be representative or nonrepresentative. Several types of samples exist: area sample – a method of sampling that can be used when the numbers in the population are unknown. The total area to be sampled is divided into subareas, e.g. by means of a grid that produces squares on a map; these subareas are then numbered and sampled, using a table of random numbers. cluster sample – each unit selected is a group of persons (all persons in a city block, a family, a school, etc.) rather than an individual. grab sample (sample of convenience) – samples selected by easily employed but basically nonprobabilistic methods. It is improper to generalize from the results of a survey based upon such a sample, for there is no way of knowing what types of bias may have been present. probability (random) sample –all individuals have a known chance of selection. They may all have an equal chance of being selected, or, if a stratified sampling method is used, the rate at which individuals from several subsets are sampled can be varied so as to produce greater representation of some classes than others. simple random sample – a form of sampling design in which n distinct units are selected from the N units in the population in such a way that every possible combination of n units is equally likely to be the sample selected. With this type of sampling design the probability that the ith population unit is included in the same, so that the inclusion probability is the same for each unit. Designs other than this one may also give each unit equal probability of being included, both other here does each possible sample of n units have the same probability. stratified random sample – this involves dividing the population into distinct subgroups according to some important characteristic, such as age or socioeconomic status, and selecting a random sample out of each subgroup. If the proportion of the sample drawn from each of the subgroups or strata, is the same as the proportion of the total population contained in each stratum, then all strata will be fairly represented with regard to numbers of persons in the sample. systematic sample – the procedure of selecting according to some simple, systematic rule, such as all persons whose names begin with specified alphabetic letters, born on certain dates, or located at specified points on a list. A systematic sample may lead to errors that invalidate generalizations. ```

Answer 159

The probability distribution of a statistic calculated from a random sample of a particular size. For example, the sampling distribution of the arithmetic mean of samples of size n taken from a normal distribution with mean μ with standard deviation s, is a normal distribution also with mean μ but with standard deviation .

Answer 160

A two-dimensional plot of a sample of bivariate observations. The diagram is an important aid in assessing what type of relationship links the two variables. An example is shown in below.

Answer 161

An index of the performance of a diagnostic test, calculated as the percentage of individuals with a disease who are correctly classified as having the disease, i.e. the conditional probability of having a positive test result given having the disease. A test is sensitive to the disease if it is positive for most individuals having the disease where: a. diseased individuals detected by the test (true positives) b. nondiseased individuals positive by the test (false positives) c. diseased individuals not detectable by the test (false negatives) d. nondiseased individuals negative by the test (true negatives) ``` Sensitivity = a/(a + c) Specificity = d/(b + d) ``` ``` Predictive value (positive test result) = a/(a + b) Predictive value (negative test result) = d/(c + d)3 ```

Answer 162

An index of the performance of a diagnostic test, calculated as the percentage of individuals without the disease who are classified as not having the disease, i.e. the conditional probability of a negative test result given that the disease is absent. A test is specific if it is positive for only a small percentage of those without the disease.

Answer 163

used to describe the measurement of the steepness, incline, gradient, or grade of a straight line. A higher slope value indicates a steeper incline. The slope is defined as the ratio of the "rise" divided by the "run" between two points on a line, or in other words, the ratio of the altitude change to the horizontal distance between any two points on the line. The slope of a line in the plane containing the x and y axes is generally represented by the letter m, and is defined as the change in the y coordinate divided by the corresponding change in the x coordinate, between two distinct points on the line. This is described by the following equation: m = Δy / Δx If y is a linear function of x, then the coefficient of x is the slope of the line created by plotting the function. Therefore, if the equation of the line is given in the form y = mx + b then m is the slope. This form of a line's equation is called the slope-intercept form, because b can be interpreted as the y-intercept of the line, the y-coordinate where the line intersects the y-axis.

Answer 164

A measure of dispersion or variation. The most commonly used measure of the spread of a set of observations. Equal to the positive square root of the variance.

Answer 165

The standard deviation of the sampling distribution of a statistic. For example, the standard error of the sample mean of n observations is where s2 is the variance of the original observations.

Answer 166

A set of techniques used to remove as much as possible the effects of age or other confounding variables when comparing two or more populations. The common method uses weighted averaging of rates of age, sex or some other confounding variable(s) according of some specified distribution of these variables.

Answer 167

A numerical characteristic of a sample. For example, the sample mean and sample variance.

Answer 168

Statistical methods allow an estimate to be made of the probability of the observed or greater degree of association between independent and dependent variables under the null hypothesis. From this estimate, in a sample of given size, the statistical “significance” of a result can be stated. Usually the level of statistical significance is stated by the p value.

Answer 169

a procedure that is intended to decide whether a hypothesis about the distribution of one or more populations or variables should be rejected or accepted. Statistical tests may be parametric or nonparametric.

Answer 170

A method of displaying data in which each observation is split into two parts labeled the ‘stem' and the ‘leaf'. A tally of the leaves corresponding to each stem ahs the shape of a histogram but also retains the actual observation values.

Answer 171

A series of methods for selecting ‘good' (although not necessarily the best) subsets of explanatory variables when using regression analysis. The three most commonly used of these methods are forward selection, backward elimination and a combination of both of these known as stepwise regression. The criterion used for assessing whether or not a variable should be added to an existing model in forward selection or removed from an existing model in backward elimination is, essentially, the change in the residual sum-of-squares produced by the inclusion or exclusion of the variable.

Answer 172

A radically different approach to allocating probabilities to events than, for example, the commonly used long-term relative frequency approach. In this approach, probability represents a degree of belief in a proposition, based on all the information. Two people with different information and different subjective ignorance may therefore assign different probabilities to the same proposition. They only constraint is that a single person's probabilities should not be consistent.

Answer 173

a concept in inferential statistics and descriptive statistics. More properly, it is "the sum of the squared deviations". Mathematically, it is an unscaled, or unadjusted measure of variability. When scaled for the number of degrees of freedom, it estimates the variance, or spread of the observations about their mean value. The distance from any point in a collection of data, to the mean of the data, is the deviation

Answer 174

an investigation in which information is systematically collected but in which the experimental method is not used. A population survey may be conducted by face-to-face inquiry, self-completed questionnaires, telephone, postal service, or in some other way.

Answer 175

a class of statistical procedures for estimating the survival function and for making inferences about the effects on it of treatments, prognostic factors, exposures, and other covariates.

Answer 176

A probability distribution or frequency distribution that is symmetrical about some central value.

Answer 177

The collection of individuals, items, measurements, etc., about which it is required to make inferences. Often the population actually sampled differs from the target population and this may result in misleading conclusions being made. The target population requires a clear precise definition, and that should include the geographical area (country, region, town, etc.) if relevant, the age group and gender.

Answer 178

A statistic used to assess a particular hypothesis in relation to some population. The essential requirement of such a statistic is known a distribution when the null hypothesis is true.

Answer 179

A change in the scale of measurement for some variable(s). Examples are the square room transformation and logarithm transformation.

Answer 180

the t-distribution is the distribution of a quotient of independent random variables, the numerator of which is a standard normal variate and the denominator of which is the positive square root of the quotient of a chi-square distributed variate and its number of degrees of freedom. The t-test uses a statistic that, under the null hypothesis, has the t-distribution to test whether two means differ significantly, or to test linear regression or correlation coefficients.

Answer 181

a hypothesis test for answering questions about the mean where the data are collected from two random samples of independent observations, each from an underlying normal distribution

Answer 182

a statistical significance test based on the assumption that the data are distributed in both directions from the central value(s).

Answer 183

The two-way analysis of variance is an extension to the one-way analysis of variance. There are two independent variables (hence the name two-way). The two independent variables in a two-way ANOVA are called factors. The idea is that there are two variables, or factors, which affect the dependent variable. Each factor will have two or more levels within it, and the degrees of freedom for each factor is one less than the number of levels. The same assumptions apply for one-way analysis of variance.

Answer 184

The error of rejecting a true null hypothesis; i.e. declaring a difference exists when it does not.

Answer 185

the error of failing to reject a false null hypothesis; i.e. declaring a difference does not exist when it in fact does.

Answer 186

In general terms, deviations of results or inferences from the truth, or processes leading to such deviation. More specifically, the extent to which the statistical method used in a study does not estimate the quantity thought to be estimated, or does not test the hypothesis to be tested. In estimated usually measured by the difference between a parameter estimate and its expected value. An estimator for which is said to be unbiased.

Answer 187

Some characteristic that differs from subject to subject or from time to time. Any attribute, phenomenon, or event that can have different values.

Answer 188

In a population, the second moment about the mean

Answer 189

An average of quantities to which have been attached a series of weights in order to make proper allowance for their relative importance

Answer 190

a sample that is not strictly proportional to the distribution of classes in the universe population. A weighted sample has been adjusted to include larger proportions of some than other parts of the population because those parts accorded greater “weight” would otherwise not have sufficient numbers in the sample to lead to generalizable conclusions, or because they are considered to be more important, more interesting, more worthy of detailed study or other reasons.

Answer 191

Variable values transformed to zero mean and unit variance.

Answer 192

A test for assessing hypotheses about population means when their variances are known

Answer 193

A transformation of Pearson's product moment correlation coefficient, r, give by z = 1/2 ln [(1+r)/(1-r)] The statistic z has a normal distribution with mean z = 1/2 ln [(1+p)/(1-p)] where p is the population correlation value and variance 1/(n-3) where n is the sample size. The transformation may be used to test hypotheses and to construct confidence intervals for p.