Final Flashcards

(147 cards)

1
Q

data

A

observations collected from field notes, surveys, experiments, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the backbone of statistical investigation

A

data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

statistics

A

study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

classic challenge in statistics

A

evaluating the efficacy of medical treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

summary statistic

A

a single number summarizing a large amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

variables

A

characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data matrix

A

a way to organize data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

numerical variable

A

wide range of numerical values, sensible to add/subtract/take averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

types of numerical variables

A

discrete, continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

discrete

A

can only take numerical values with jumps (eg number of siblings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

continuous

A

can take numerical values without jumps (eg height)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

categorical

A

responses are categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

types of categorical

A

ordinal, nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ordinal variable

A

categorical but have a natural ordering (eg Likert scale)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

nominal variable

A

categorical and no natural ordering (eg favourite ice cream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

negative, positive, independent association

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

population vs sample

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

anecdotal evidence

A

data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

random sampling

A

avoid adding bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

simple random sampling

A

most basic random sample, using raffle; every case in population has equal chance of being included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

non response bias

A

response rates can influence bias from a random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

convenience sampling

A

individuals who are easily accessible are more likely to be included in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

independent variable

A

explanatory variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

response variables

A

dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
observational studies
collection of data in way that doesn't directly interfere with how the data arises eg: collecting surveys, ethnography, etc
26
randomized experiment
when individuals are randomly assigned to a group
27
confounding variable
variable correlated with both the explanatory and response variables aka: lurking variable, confounding factor, confounder
28
prospective study
identifies individuals and collects information as events unfold eg: medical researchers may identify and follow a group of similar individuals over many years
29
retrospective study
collects data after events have taken place | eg: researchers may review past events in medical records
30
simple random sampling
every case in population has equal chance of being included
31
stratified sampling
divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely
32
when is stratified sampling useful?
when cases in each stratum are very similar with respect to the outcome of interest
33
cluster sampling
break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools
34
multistage sampling
like cluster sample, but collect random sample within each selected cluster
35
pros and cons of multistage sampling
+cluster/multistage can be more economical than alternative sampling techniques +most useful when there's a lot of case-to-case variability within cluster but clusters themselves don't look very different from one another eg: neighbourhoods when they are very diverse -more advanced analysis techniques are typically required
36
scatter plots and its strength
provides case by case view of two numerical variables | +helpful in quickly spotting associations relating variables, trends, etc
37
dot plots
provides most basic of displays for one variable; like a one-variable dot plot
38
mean
common way to measure centre of distribution of data - add up and divide by n - often labeled as x-bar
39
μ
population mean
40
μx
used to represent which variable to population mean refers to
41
histograms
doesn't show value of each observation each value blongs to bin binned counts are plotted as bars on histogram provide view of data density
42
pros and cons of histogram
convenient for describing shape of data distribution
43
skewness
``` right skew (longer right tail) left skew (longer left tail) symmetric (equal tails) ```
44
one, two, three prominent peaks
unimodal, bimodal, multimodal
45
two measures of variability
varaince, standard deviation
46
variance
the average squared deviation | σ2, standard deviation squared
47
standard deviation
σ | describes how far way the typical observation is from the mean
48
deviation
distance of an observation from its mean
49
box plots
•summarizes data set using five statistics while also plotting unusual observations •step 1: draw dark line denoting the median, which splits data in half •step 2: draw rectangle to represent the middle 50% of the data ⁃aka interquartile range aka IQR ⁃measure of variability in data ⁃the more variable the data, the larger the standard deviation and IQR ⁃two boundaries are called first quartile and third quartile ⁃Q1 and Q3 respectively ⁃IQR = Q3 — Q1 •step 3: whiskers attempt to capture data outside of the box ⁃reach is never allowed to be more than 1.5 x IQR •step 4: any observations beyond the whiskers are identified as outliers •robust estimates: extreme observations have little effect on value ⁃median and IQR are robust estimates
50
mapping data
colours are used to show higher and lower values of a variable not helpful for getting precise values helpful for seeing geographic trends and generating interesting research questions
51
contingency tables
summarized data for two categorical variables | -each value in table represents number of times a particular combination of variable outcomes occurred
52
row totals
total counts across each row
53
column totals
total counts down each column
54
relative frequency table
replace counts with percentages or proportions
55
row proportions
computed as counts divided by row totals
56
segmented bar plots
graphical display of contingency table information
57
mosaic plot
graphical display of contingency table information | -use areas to represent number of observations
58
probability
proportion of times the outcome would occur if we observed the random process an infinite number of times
59
law of large numbers
as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome
60
disjoint outcomes
aka mutually exclusive | when two outcomes cannot happen at the same time
61
probability distributions
table of all disjoint outcomes and their associated probabilities
62
complement of event
all outcomes not in the event
63
sample space
set of all possible outcomes
64
independence
when knowing the outcome of one process provides no useful information about the outcome of the other
65
marginal probability
if a probability is based on a single varaible
66
joint probability
probability of outcomes is based on two or more variables
67
defining conditional probability
two parts: outcome of interest and condition
68
condition
information we know to be true
69
conditional probability
the outcome of interests A given condition B
70
tree diagrams
organize outcomes and probabilities around the structure of data
71
when are tree diagrams most useful?
when two or more processes occur in a sequence and each process is conditioned on its predecessors
72
expected value of X
average outcome of X | denoated by E(X)
73
deductive
reasoning
74
inductive
experience and reasoning
75
wheel of science
/\ deduction | theory | | / \ | | / \ | | empirical hypothesis | | generalizations / | | \ / | | \ / | | observations | induction \/
76
measurement
downward part of wheel of science
77
conceptualization vs operationalize
"lack of money" vs "lack of opportunity" are two conceptualizations of poverty "do you have enough money to feed your family?" operationalizes the conceptualization of poverty different conceptualizations often require different operationalizations
78
quantitative vs qualitative
a little about a lot of people vs a lot about a few people
79
administrative data
growing source digitial data that is collected in process of administering other social goals everything from information attached to social health number to credit card number hard to make generalizations beyond the population eg using database dealing with health cards is hard to generalize to all of Canada because people who didn't use health cards would be completely ignored
80
survey research
designed to ask research questions responses distilled into data that we work with measurement necessitates some simplification because we need to compare across different groups of people
81
population vs sample
group we want to make a generalization about vs the group we actually have information about
82
census
rare kind of sample that covers an entire population, can be very expensive basically the opposite of an annecdote
83
what is snowball sampling often used for?
vulnerable communities like illegal immigrant workers in America
84
complex random sampling
sample is still random, but we tweak things so that some cases are less/more likely to be selected
85
three sources of bais
non-response voluntary response convenience response
86
experiments
typicaly create artificial situtions that are designed to isolate variables of interest and their effects
87
pros and cons of observational studies?
+make meaningful connection | -hard to make assumptions of causation
88
R
increasingly popular open source client | accessible because it's free
89
SPSS
popular for undergrads and certain fields | designed for doing experiment research
90
Stata
popular among sociologists and economists
91
stacked dot plot
higher bars represent areas where there are more observations makes it easier to judge the centre and shape of the distribution
92
shape of distribution is determined by....
modality (how mnay humps?) skewness (one side of distribution looks very different from other side) outliers (one or two variables are unusual)
93
questionaire
contains actual phrasing of question and options for the responses
94
codebook
summarize the data set; tells us what the dataset names mean like dictionary
95
CANSIM
micro data, summary statistics (overall estimates)
96
ODESI
contains confidential information we can use the public-use parts of ODESI, in which everything is anonymized and variables have been "tweaked" a little in order to make sure that information can't be traced back to respondents
97
RDC
Research Data Centre; stuff you can't find on PUMFs
98
measures of central tendency
mode, median, mean; ie where does the modality tend to accumulate?
99
pros and cons of mode
+can be used for all types of measures, relatively quick/simple measure -doesn't ues much information, most common doesn't necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren't other ages)
100
how to calculate median
odd: middle observation even: average of two middle observations
101
pros and cons of median
+capture actual centre of distribution, less suceptible to outliers -computationally awkward, cannot be estimated for unordered categorical variables
102
percentiles
general concept, closely related to median (median = 50th percentile) 100 percetniles
103
itnerquartile range
between 25th and 75th
104
90th percentile
90% of observations are lower, 10% are higher
105
25th percentile
25% of observation are lower, 75% are higher
106
mean cons
more susceptible to outliers
107
measures of dispersion
aim to give us a sense of breath of distribution | e.g. compare temperature in Saskatoon vs Vancouver
108
range
interval between smallest and largest values
109
pros and cons of range
+good for quick check | -only takes into account two observations, very sensitive, only useful for numeric variables
110
pros and cons of standard deviation
+variance and SD take into account all scores, accurately describes "typical" deviation, easily interpreted -sensitive to outliers, can only be calculated for numerical variables
111
proportions
frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases
112
frequency cons
working with them is tough when trying to conceptualize comparisons -this can be fixed by changing them into percentages
113
cumulative percentage
the percentage in the category + the category under it | only works for ordinal variables
114
random process
a process where we know what outcomes can happen, but we don’t know which particular outcome will happen
115
rules for probability distribution
1. outcomes listed are disjoint 2. each probability must equal between 0 and 1 3. all probabilities must total 1
116
algebra of possibility
if we know the possibility of their component outcomes, we can know the probability of two events
117
continuous distribution
another way of summarizing information -more advanced mathematical concept than bar graph the line is called probability density function -describes information in graph -has interesting properties -can be used to infer probability of any outcome -never loops back (line only moves from left to right) -always less than one -the area under the curve adds up to 1
118
area equals p
the area under the curve gives the probability of people falling in that range
119
frequency table (and a disadvantage)
lists all the qualities variables can take on and how many people answered to each quality -impractical for continuous variables because data gets too unwieldy
120
pie charts
``` they suck don't use pie charts they're misleading only really great for visualness and public information only work for things that sum to 100 ```
121
bar charts
display simple information well can chart frequencies and proportions information doesn't need to sum to 100
122
law of large numbers
as more observations of a random process are collected, the proportion of occurences with a particular outcome converges to the probability of that outcome
123
normal distribution
unimodal, symmetric, bell shaped curve | many variables are nearly normal, but none are exactly normal
124
what are normal distributions defined by?
the mean (where they sit on the number line) and SD (peakness)
125
z scores
how many standard deviations does x fall from the mean? | every z score corresponds to a specific percentile
126
inferential statistics
saying things about society as a whole without futile attempt to examine the whole society
127
parameters
hypothetical number that exists somewhere | any characteristic of a population can be defined by a parameter
128
sampling error
the difference between estimate and actual parameter | unless we survey every case in the population, we will always have sampling error
129
sampling distribution
hypothetical distribution if we could sample our population an infite number of times
130
standard error
typical or expected error (standard deviation) based on sampling distribution aka standard deviation of sampling distribution -no obvious way to estimate SE from single sample
131
central limit theorem
if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model as n becomes large, the sampling distribution approaches normality and it has less and less error in it standard error will be bigger if the population has a larger population we can decrease our standard error if we make a bigger sample
132
recipe for statistical inference
estimate standard error desired confidence level
133
confidence intervals
a plausible range of the population paramter "what is the porbability that the population mean falls within a certain range" trades off with confidence
134
narrowing intervals
we can narrow confidence interval without reducing confidence by reducing our standard error
135
p values
probability of observing data favourable to the alternative hypothesis if null is true p values are controversial the greater the p value, the more likely the null is true isn't a quantifier, only a probability
136
hypothesis testing
comparing world we actually observe to what we think the world should be like if our evidence looks nothing like the null, we can reject the null
137
why null?
we don't want to say how certain we are because we can never collect all the information, therefore there is always a possibility of one case out there proving us wrong. So we try to improve our chances that hypothesis is right. A type of process of elimination
138
why double negatives?
because we accept the hypothesis condtionally, with some probability, but not absolute certainty
139
alpha level
expresses same information as confidence level, except alpha level shows how unconfident you are. e.g. if confidence level is 95%, alpha level is 0.05
140
single tail tests
how far away does x-bar distribution need to be? if we get a z-score of <1.29 when we test whether X-bar is greater than or less than population mean, but not both only common in psychology
141
why don't we use single tail tests that often?
because there's a way of framing single tail tests that make it accidentally easier to rejet the null, therefore more likely to find positive research findings, and lowers the quality of the results
142
hypothesis testing framework
(1) write the hypothesis in plain language, then in mathematical notion (2) identify an appropriate point estimate of the parameter of interest (mean) (3) verify conditions to ensure the standard error estimate is reasonable and the point estimate is nearly normal and unbiased (4) compute standard error. draw a picture depicting the distribution of the estimate under the idea that H0 is true - shade areas representing the p-vlaue (5) using the picture, compute the test statistic i.e. Z-score and identity the p-value to evaluate hypothesis (6) write conclusion in plain language
143
two tail tests
we distribute critical region, we don't assume whether sampling distribution is above or below, just about whether it falls outside or inside we have to have a larger x-value to reject the hypothesis
144
type 1 vs type 2 error
type 1: falsely rejecting the null | type 2: falsely accepting the null
145
writing null vs writing alternative
H0 = null hypothesis -skeptical perspective or claim to be tested -always write the null hypothesis as an equality HA = alternative hypothesis -alternative or new claim under consideration
146
testing appropriateness of normal model
(1) fit simple histogram over normal curve | (2) examine normal probability plot
147
bin size
adding more bins provides greater detail when sample is large, smaller bins still work well smaller sample sizes, small bins are very volatile