STA8170 Flashcards

(115 cards)

1
Q

Data

A

systematically recorded values (numbers or labels) together with their context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Categorical/qualitative variable

A

variable that names categories with words or numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Context (info required for?) (x6)

A
who was measured
what was measured 
how data was collected
where data was collected
when and why study was done
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Rows in a data table hold…

A

individual cases, eg respondents, participants, subjects, units, records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Columns in a data table hold…

A

variables that give info about each individual case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative variable

A

an amount or degree, measured in meaningful numbers eg scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Identifiers

A

variable that assigns unique value to each individual/case - cannot be analysed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Relational database

A

large data bases that link data tables together by matching identifiers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ordinal variable

A

categorical variable with ordering of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data table

A

an arrangement of data in which each row represents a case, and each column a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Case

A

individual about whom/which we have data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Record

A

info about an individual/case in a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sample (x2)

A

representative subset of population

analysed to estimate/learn about the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Population

A

the collection of all individuals or

items or objects of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Nominal variable

A

variable whose values are only names of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Units

A

quantity or amount used as standard of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Parameter (and greek letter)

A

any numerical characteristic of a population - μ (meuw)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Distribution (x2)

A

description of all the values a variable can take, and how often those values occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Three important things pictures can do in data analysis?

A

reveal things not able to be seen in data tables, helping to think about patterns/relationships
show important features in the data
tell others about the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Area principle (for graphing data)

A

the area occupied by a part of the graph should correspond to the magnitude of value it represents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Frequency table (x3)

A

organises the cases according to their variable
rows are category names
also records totals
describes the distribution of a categorical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Relative frequency table (x2)

A

displays percentages, rather than counts, of values in each category
describes the distribution of a categorical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Bar chart (x3)

A

Display distribution of a categorical variable
Categories on the x, counts on the 7
spaces between the bars indicate that freestanding bars can be placed in any order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Relative frequency bar chart

A

shows the percentage/proportion of values (y) falling under each category (x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Pie charts are used to display...? | Plus one disadvantage
categorical data | visual comparisons between categories are more difficult than in eg a bar chart
26
Contingency table
how cases are distributed along each variable, dependent on the other variable
27
Marginal distribution
the totals displayed (as counts or %) in the bottom row and last column of contingency tables
28
Conditional distribution
show the distribution of one variable for just those cases that satisfy a condition on another variable
29
Independent variables in a contingency table are when... (x2)
the distribution of one variable is the same for all categories of another ie there is no association between them
30
Histogram (x3)
Bar chart for quantitative data Counts (y) grouped into bins (x) that make up the bars No gaps between bars - or gap indicates no values for that bin
31
Relative frequency histogram
Use percentage on y-axis instead of counts
32
Stem and leaf plot (x3)
Similar to histogram, but shows the individual values Useful for doing by hand or in Word, for <100 values Stem values on the vertical axis, leaves across the horizontal
33
Dotplots
Like a stem and leaf, but with dots | Can be vertical (like stem plot) or horizontal
34
Categorical data condition (for deciding on how to display data) (x2)
Data is counts or percentages of individual cases in categories Categories do not overlap
35
Quantitative data condition (for deciding on how to display data)
Data ar values of a quantitative variable whose units are known`
36
Four components for descriptions of distribution (plus egs) that mean you should be able to...
shape - symmetry, skew, gaps outliers centre - median spread - range, interquartile range roughly sketch the distribution
37
Modes (plus 3 types)
the peaks in distributions unimodal bimodal multimodal
38
A distribution with no modes is described as...
uniform
39
Skew (x2)
a distribution with longer tail on one side | skew is described as to the side with the longer tail
40
Median (x3, plus how to find, x2)
the middle value that divides a histogram into two equal areas appropriate description of centre for skewed distributions or with outliers always pair with the IQR if n is odd, median is the middle value if n is even, median is the average of the two middle values
41
Range
difference between min and max values in a distribution
42
Quartile
the dividing points of the number of values/cases in a distribution divided by four
43
Interquartile range (x2)
= upper quartile - low quartile | the data between the 25th and 75th percentile
44
Percentile (plus eg x1)
the value that leaves that percentage of data below it | eg, 25th percentile has 5% of data below it
45
Five number summaries of distribution include...
``` minimum q1 median q3 maximum ```
46
Boxplots (x7)
display of the five number summary vertical axis from min to max of data box around q1 and q3 horizontal line inside box at the median 'fences' at 1.5 IQRs beyond lower and upper quartiles (not displayed, just for working) whiskers from box to most extreme data values found within the fences add dots for any values found outside the fences
47
Mean (x4)
average of all values in a distribution appropriate description of centre for roughly symmetrical/normal data sets always pair with SD notation - a bar above the symbol, eg ū = the mean of u, pronounce u-bar
48
Standard deviation (x3)
describes the spread of a distribution root of the average of squared deviation of each value from the mean (average of deviations would cancel each other out)
49
Variance
the average of the squared deviations of each value from the mean
50
A calculated summary is described as resistant if... (x2)
outliers only have a small effect on it | eg median and IQRs
51
Timeplot (x2)
a display of values (y) against time (x) | discern patterns by applying the lowess method - makes a smooth trace line of best fit
52
Moving average (plus method x2)
method for smoothing timeplots to identfiy trends | find the average value for a given time window, then move the window along by one timepoint and take a new average
53
Exponential smoothing
method for smoothing timeplots to identify trends more sophisticated than moving average method gives more weight to recent values, and less as they recede into the past
54
Re-expressing/transforming data is... (x3)
applying a simple function to make a skewed distribution more symmetrical enables better use of centre and spread distribution descriptors can facilitate the comparison of groups with very different distributions of scores
55
Rules of thumb for transformations of skewed data (x2)
variables that skew to the right often helped by square roots, logs, reciprocal Skew to the left often helped by squaring the data
56
When comparing distributions consider their... (x3)
shape centre spread
57
When comparing boxplots, consider their...(x4)
shape - symmetric, skewed, diffs between groups medians - which group has higher centre, any pattern to medians IQRs - groups with more spread, patterns to change in IQRs outliers - identify, consider, check for errors
58
For outliers, consider... (x2)
context - what is extreme in one context may be normal in another
59
Order of median, mode and mean in a positive/right skewed distribution
mean>median>mode
60
Order of median, mode and mean in a negative/left skewed distribution
mean
61
Positive skew is... (x2)
skew to the right, | ie longer tail to the right
62
Negative skew is... (x2)
skew to the left, | ie longer tail to the left
63
How do you standardise a value? (calculate a z-score) (x2)
Subtract the mean form the value, | Divide the difference by the standard deviation
64
What does a z-score represent?
the distance of a value from the mean in standard deviations
65
Greek letters are used for...
model parameters
66
Latin letters are used for...
statistics
67
What is the standard normal model/distribution? (x2)
a normal distribution with mean = 0 and SD = 1 | ie after you've standardised/calculated z-scores
68
Nearly normal data condition, and how to check
shape of distribution is unimodal and symmetric | check with histogram or Normal probability plot
69
How much of a normal distribution fits with 1, 2 and 3 SDs of the mean?
68% 95% 99.7%
70
Shifting a distribution... (x2)
is adding a constant to each value, | does not change SD or IQR
71
Rescaling a distribution...(x2)
is multiplying each value by a constant | also multiplies mean, median, quartiles, SD and IQR by the constant
72
Parameter
a numerically valued attribute of a model
73
Statistic
a value calculated to summarise data
74
Normal percentile
that corresponds to a z-score gives the percentage of values found at that z-score and below
75
Normal probability plot (x3)
plots actual vs expected score if straight, distribution is normal Called P-P plots in SPSS
76
σ (x2)
sigma | standard deviation
77
μ (x2)
meuw | mean
78
N(μ, σ) | x2
Normal model | Parameters are mean and SD
79
Formula for finding a value from a z-score
y = μ + z * σ
80
Scatterplot (and how to describe x4)
dot point graph of two variables on x and y axes describe with positive/negative direction/trend, form/shape of dots (straight, curved, no pattern?), strength of relationship (how close together dots are) and unusual features/outliers
81
How to choose x and y axis for scatterplot vars?
put the variable of interest (DV), that you want to predict and responds to levels of the other var, on y-axis put the explanatory or predictor var (IV) on x-axis
82
What assumptions/conditions must be met before using a correlation? (x3)
quantitative variables condition - can't use categorical data straight enough condition - check the scatter plot for linear relationship no outliers condition - can distort strength or direction of a correlation
83
What is Spearman's Rho (ρ) useful for?
Calculating non-parametric association (correlation) when distribution is not straight enough or has outliers
84
What is Kendall's tau (τ) useful for? (plus eg x1)
``` Calculating trend (monotonic relationship - correlation) when relationship is not linear eg when data not truly quantitative ```
85
What is a lurking variable?
A hidden variable that influences both variables in our relationship/correlation
86
Transformation through squaring is useful when (x2)
unimodal distriubution is skewd to the left | scatterplot bends downwards
87
Transformation through finding the root is useful when
data is a count of something
88
Transformation using log is useful when (plus one note)
measurements cannot be negative, or grow by percentage increases nb, if there are zeros in the data try adding a small constant first
89
Transformation through negative reciprocal square root (-1 divided by the root of y) is useful when
you want to preserve the direction of the relationship
90
Transformation through negative reciprocal (-1 divided by y) is useful when (plus one note)
your data is the ratio of two quantities, eg miles per hour | nb, if there are zeros in the data try adding a small constant first
91
What is the ladder of powers? (x2, plus the 6 steps)
order that the effects of transformations have on data if transformation make data worse, move in the other direction on ladder Power 2 - squaring the data Power 1 - no change, going further down or up from here increases effect Power 1/2 - square root Power 0 - we place log in this spot Power -1/2 - negative reciprocal root (-1 over root of y) Power -1 - negative reciprocal (-1 over y)
92
What is y-hat (y ̂ )?
the value predicted by a regression equation/line of best fit
93
What is a residual (in regression)?
the difference between predicted (y-hat) and observed/actual (y) value residual = observed value - predicted value
94
The least squares line is...
the line of best fit in regression/scatterplot | the line for which the sum of the squared residuals is smallest
95
Why must residuals be squared when calculating line of best fit/least squares?
because some of them will be negative
96
What does b represent in the linear model?
coefficients
97
What is the slope in the linear model (x2, plus notation)?
always measured/interpreted as units of y per unit of x how rapidly y-hat responds to changes in x b1 (1 is subscript)
98
What is the intercept in the linear model (x2, plus notation)?
where the line hits the y-axis the starting point/baseline for our predictions b0 (0 is subscript)
99
Equation for the linear model? (notation and in words)
y-hat = b0 + b1x | predicted y = intercept plus slope times x
100
What is the equation for finding slope in linear regression? (notation and in words)
``` b1 = r x (SDy/SDx) slope = correlation times (standard deviation of y over the standard deviation of x) ```
101
What is the equation for finding the intercept in linear regression? (notation and in words)
``` b0 = meany - b1 x meanx intercept = the mean of y minus the (slope times the mean of x) ```
102
Define regression
the linear model fit by least squares
103
What are the conditions/assumptions that must be met before we can use regression?
same as for correlation: quantitative data straight enough relationship no outliers
104
Explain regression to the mean (x3)
You can never predict that y will be further away from the mean than x was because equation for predicting z-scores is z-hat of y = r times the z of x and r can only be between -1 and 1
105
What is the slope of a line of best fit for the z-scores of any two variables?
r (the correlation coefficient)
106
Formula for standard deviation of the residuals
find the root of (sum of error squared over (n - 2))
107
What does R-squared represent?
the variation/portion accounted for by the linear model
108
What does 1 - R-squared represent?
the variation/portion not accounted for by the linear model (the residuals/error)
109
Conditions that must be met for regression (x4)
quantitative variables condition For both data and residuals, check: straight enough condition (linear relationship on scatterplot) does the plot thicken? condition (even scatter around the line of best fit, or across scatterplot for residuals) outlier condition investigate any - they strongly affect r)
110
Assumptions of the linear model (x4)
variables are quantitative their relationship is linear error is approximately normally distributed variance of the error is constant
111
Leverage in regression models refers to...
the fact that the further a given point is from the meanX, the more strongly they pull on the regression line
112
A data point is 'influential' in regression models if...
removing it from the analysis makes a meaningful difference to the model
113
Best graphical display for exploring two categorical variables
Two-way table
114
Best graphical display for exploring two quantitative variables
Scatterplot
115
Best graphical display for exploring one categorical and one quantitative variable
Boxplot