Final Flashcards

Question

observational studies

Answer 1

collection of data in way that doesn't directly interfere with how the data arises eg: collecting surveys, ethnography, etc

Answer 2

when individuals are randomly assigned to a group

Answer 3

variable correlated with both the explanatory and response variables aka: lurking variable, confounding factor, confounder

Answer 4

identifies individuals and collects information as events unfold eg: medical researchers may identify and follow a group of similar individuals over many years

Answer 5

collects data after events have taken place | eg: researchers may review past events in medical records

Answer 6

every case in population has equal chance of being included

Answer 7

divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely

Answer 8

when cases in each stratum are very similar with respect to the outcome of interest

Answer 9

break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools

Answer 10

like cluster sample, but collect random sample within each selected cluster

Answer 11

+cluster/multistage can be more economical than alternative sampling techniques +most useful when there's a lot of case-to-case variability within cluster but clusters themselves don't look very different from one another eg: neighbourhoods when they are very diverse -more advanced analysis techniques are typically required

Answer 12

provides case by case view of two numerical variables | +helpful in quickly spotting associations relating variables, trends, etc

Answer 13

provides most basic of displays for one variable; like a one-variable dot plot

Answer 14

common way to measure centre of distribution of data - add up and divide by n - often labeled as x-bar

Answer 15

population mean

Answer 16

used to represent which variable to population mean refers to

Answer 17

doesn't show value of each observation each value blongs to bin binned counts are plotted as bars on histogram provide view of data density

Answer 18

convenient for describing shape of data distribution

Answer 19

``` right skew (longer right tail) left skew (longer left tail) symmetric (equal tails) ```

Answer 20

unimodal, bimodal, multimodal

Answer 21

varaince, standard deviation

Answer 22

the average squared deviation | σ2, standard deviation squared

Answer 23

σ | describes how far way the typical observation is from the mean

Answer 24

distance of an observation from its mean

Answer 25

•summarizes data set using five statistics while also plotting unusual observations •step 1: draw dark line denoting the median, which splits data in half •step 2: draw rectangle to represent the middle 50% of the data ⁃aka interquartile range aka IQR ⁃measure of variability in data ⁃the more variable the data, the larger the standard deviation and IQR ⁃two boundaries are called first quartile and third quartile ⁃Q1 and Q3 respectively ⁃IQR = Q3 — Q1 •step 3: whiskers attempt to capture data outside of the box ⁃reach is never allowed to be more than 1.5 x IQR •step 4: any observations beyond the whiskers are identified as outliers •robust estimates: extreme observations have little effect on value ⁃median and IQR are robust estimates

Answer 26

colours are used to show higher and lower values of a variable not helpful for getting precise values helpful for seeing geographic trends and generating interesting research questions

Answer 27

summarized data for two categorical variables | -each value in table represents number of times a particular combination of variable outcomes occurred

Answer 28

total counts across each row

Answer 29

total counts down each column

Answer 30

replace counts with percentages or proportions

Answer 31

computed as counts divided by row totals

Answer 32

graphical display of contingency table information

Answer 33

graphical display of contingency table information | -use areas to represent number of observations

Answer 34

proportion of times the outcome would occur if we observed the random process an infinite number of times

Answer 35

as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome

Answer 36

aka mutually exclusive | when two outcomes cannot happen at the same time

Answer 37

table of all disjoint outcomes and their associated probabilities

Answer 38

all outcomes not in the event

Answer 39

set of all possible outcomes

Answer 40

when knowing the outcome of one process provides no useful information about the outcome of the other

Answer 41

if a probability is based on a single varaible

Answer 42

probability of outcomes is based on two or more variables

Answer 43

two parts: outcome of interest and condition

Answer 44

information we know to be true

Answer 45

the outcome of interests A given condition B

Answer 46

organize outcomes and probabilities around the structure of data

Answer 47

when two or more processes occur in a sequence and each process is conditioned on its predecessors

Answer 48

average outcome of X | denoated by E(X)

Answer 49

experience and reasoning

Answer 50

Answer 51

downward part of wheel of science

Answer 52

"lack of money" vs "lack of opportunity" are two conceptualizations of poverty "do you have enough money to feed your family?" operationalizes the conceptualization of poverty different conceptualizations often require different operationalizations

Answer 53

a little about a lot of people vs a lot about a few people

Answer 54

growing source digitial data that is collected in process of administering other social goals everything from information attached to social health number to credit card number hard to make generalizations beyond the population eg using database dealing with health cards is hard to generalize to all of Canada because people who didn't use health cards would be completely ignored

Answer 55

designed to ask research questions responses distilled into data that we work with measurement necessitates some simplification because we need to compare across different groups of people

Answer 56

group we want to make a generalization about vs the group we actually have information about

Answer 57

rare kind of sample that covers an entire population, can be very expensive basically the opposite of an annecdote

Answer 58

vulnerable communities like illegal immigrant workers in America

Answer 59

sample is still random, but we tweak things so that some cases are less/more likely to be selected

Answer 60

non-response voluntary response convenience response

Answer 61

typicaly create artificial situtions that are designed to isolate variables of interest and their effects

Answer 62

+make meaningful connection | -hard to make assumptions of causation

Answer 63

increasingly popular open source client | accessible because it's free

Answer 64

popular for undergrads and certain fields | designed for doing experiment research

Answer 65

popular among sociologists and economists

Answer 66

higher bars represent areas where there are more observations makes it easier to judge the centre and shape of the distribution

Answer 67

modality (how mnay humps?) skewness (one side of distribution looks very different from other side) outliers (one or two variables are unusual)

Answer 68

contains actual phrasing of question and options for the responses

Answer 69

summarize the data set; tells us what the dataset names mean like dictionary

Answer 70

micro data, summary statistics (overall estimates)

Answer 71

contains confidential information we can use the public-use parts of ODESI, in which everything is anonymized and variables have been "tweaked" a little in order to make sure that information can't be traced back to respondents

Answer 72

Research Data Centre; stuff you can't find on PUMFs

Answer 73

mode, median, mean; ie where does the modality tend to accumulate?

Answer 74

+can be used for all types of measures, relatively quick/simple measure -doesn't ues much information, most common doesn't necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren't other ages)

Answer 75

odd: middle observation even: average of two middle observations

Answer 76

+capture actual centre of distribution, less suceptible to outliers -computationally awkward, cannot be estimated for unordered categorical variables

Answer 77

general concept, closely related to median (median = 50th percentile) 100 percetniles

Answer 78

between 25th and 75th

Answer 79

90% of observations are lower, 10% are higher

Answer 80

25% of observation are lower, 75% are higher

Answer 81

more susceptible to outliers

Answer 82

aim to give us a sense of breath of distribution | e.g. compare temperature in Saskatoon vs Vancouver

Answer 83

interval between smallest and largest values

Answer 84

+good for quick check | -only takes into account two observations, very sensitive, only useful for numeric variables

Answer 85

+variance and SD take into account all scores, accurately describes "typical" deviation, easily interpreted -sensitive to outliers, can only be calculated for numerical variables

Answer 86

frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases

Answer 87

working with them is tough when trying to conceptualize comparisons -this can be fixed by changing them into percentages

Answer 88

the percentage in the category + the category under it | only works for ordinal variables

Answer 89

a process where we know what outcomes can happen, but we don’t know which particular outcome will happen

Answer 90

1. outcomes listed are disjoint 2. each probability must equal between 0 and 1 3. all probabilities must total 1

Answer 91

if we know the possibility of their component outcomes, we can know the probability of two events

Answer 92

another way of summarizing information -more advanced mathematical concept than bar graph the line is called probability density function -describes information in graph -has interesting properties -can be used to infer probability of any outcome -never loops back (line only moves from left to right) -always less than one -the area under the curve adds up to 1

Answer 93

the area under the curve gives the probability of people falling in that range

Answer 94

lists all the qualities variables can take on and how many people answered to each quality -impractical for continuous variables because data gets too unwieldy

Answer 95

``` they suck don't use pie charts they're misleading only really great for visualness and public information only work for things that sum to 100 ```

Answer 96

display simple information well can chart frequencies and proportions information doesn't need to sum to 100

Answer 97

as more observations of a random process are collected, the proportion of occurences with a particular outcome converges to the probability of that outcome

Answer 98

unimodal, symmetric, bell shaped curve | many variables are nearly normal, but none are exactly normal

Answer 99

the mean (where they sit on the number line) and SD (peakness)

Answer 100

how many standard deviations does x fall from the mean? | every z score corresponds to a specific percentile

Answer 101

saying things about society as a whole without futile attempt to examine the whole society

Answer 102

hypothetical number that exists somewhere | any characteristic of a population can be defined by a parameter

Answer 103

the difference between estimate and actual parameter | unless we survey every case in the population, we will always have sampling error

Answer 104

hypothetical distribution if we could sample our population an infite number of times

Answer 105

typical or expected error (standard deviation) based on sampling distribution aka standard deviation of sampling distribution -no obvious way to estimate SE from single sample

Answer 106

if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model as n becomes large, the sampling distribution approaches normality and it has less and less error in it standard error will be bigger if the population has a larger population we can decrease our standard error if we make a bigger sample

Answer 107

estimate standard error desired confidence level

Answer 108

a plausible range of the population paramter "what is the porbability that the population mean falls within a certain range" trades off with confidence

Answer 109

we can narrow confidence interval without reducing confidence by reducing our standard error

Answer 110

probability of observing data favourable to the alternative hypothesis if null is true p values are controversial the greater the p value, the more likely the null is true isn't a quantifier, only a probability

Answer 111

comparing world we actually observe to what we think the world should be like if our evidence looks nothing like the null, we can reject the null

Answer 112

we don't want to say how certain we are because we can never collect all the information, therefore there is always a possibility of one case out there proving us wrong. So we try to improve our chances that hypothesis is right. A type of process of elimination

Answer 113

because we accept the hypothesis condtionally, with some probability, but not absolute certainty

Answer 114

expresses same information as confidence level, except alpha level shows how unconfident you are. e.g. if confidence level is 95%, alpha level is 0.05

Answer 115

how far away does x-bar distribution need to be? if we get a z-score of <1.29 when we test whether X-bar is greater than or less than population mean, but not both only common in psychology

Answer 116

because there's a way of framing single tail tests that make it accidentally easier to rejet the null, therefore more likely to find positive research findings, and lowers the quality of the results

Answer 117

(1) write the hypothesis in plain language, then in mathematical notion (2) identify an appropriate point estimate of the parameter of interest (mean) (3) verify conditions to ensure the standard error estimate is reasonable and the point estimate is nearly normal and unbiased (4) compute standard error. draw a picture depicting the distribution of the estimate under the idea that H0 is true - shade areas representing the p-vlaue (5) using the picture, compute the test statistic i.e. Z-score and identity the p-value to evaluate hypothesis (6) write conclusion in plain language

Answer 118

we distribute critical region, we don't assume whether sampling distribution is above or below, just about whether it falls outside or inside we have to have a larger x-value to reject the hypothesis

Answer 119

type 1: falsely rejecting the null | type 2: falsely accepting the null

Answer 120

H0 = null hypothesis -skeptical perspective or claim to be tested -always write the null hypothesis as an equality HA = alternative hypothesis -alternative or new claim under consideration

Answer 121

(1) fit simple histogram over normal curve | (2) examine normal probability plot

Answer 122

adding more bins provides greater detail when sample is large, smaller bins still work well smaller sample sizes, small bins are very volatile

Final Flashcards

(147 cards)