Module 1: Introduction to Data Flashcards

1
Q

Concept

A

Answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A frequency table exhibits how…

A

frequencies are distributed over various categories (known as a frequency distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Associated variables

A

When two variables show some connection/relationship with one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Blocking (experimental design)

A

Grouping the sample based on variables which may effect the outcome and then randomizing within groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Categorical variable

A

The individual entries are categories, the possible values are called “levels”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cluster sample

A

Break the population into groups and then sample a fixed number of those groups and include all observations from each group; helpful when there’s a lot of variability between cases within a cluster but the clusters themselves don’t differ much from one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Confounding variable

A

A variable that is correlated with both the explanatory and the response variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous variable

A

A numerical variable that has no limitation (e.g. infinite decimal points for precision); e.x. height, weight (think how much)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Controlling (experimental design)

A

Mitigate the differences between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Convenience sample bias

A

When individuals who are more accessible are more likely to be included in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cumulative frequency

A

The total of a frequency and all frequencies below it in a frequency distribution; the running total of frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cumulative relative frequency

A

Cumulative frequency for that category/Sum of all frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data

A

Information we gather with experiments and with surveys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Description

A

Summarizing the data that are obtained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Descriptive statistics

A

Refers to methods for summarizing the data; describes the sample only (graphs, numerical summaries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Design

A

Planning how to obtain data to answer the questions of interest (experimental design, sample size, power, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discrete variable

A

A numerical variable that only takes number values in jumps (e.g. whole numbers); e.x. the number that appears when throwing a die (think how many)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Experiment

A

Used to investigate the possible causal connection between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explanatory variable

A

The variable (first) that causually affects the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Frequency

A

The number of elements that belong in a certain category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Graphical methods

A

Histogram, boxplot, bar graph, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Graphs (categorical)

A

Bar chart, pie chart; focuses on frequencies or relative frequencies of the levels of the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Graphs (numerical/scale)

A

Dot chart (discrete variable), stem-and-leaf plot, histogram, boxplot, scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Histogram

A

A bar chart that gives the frequencies or relative frequencies of occurrances of a scale variable in certain intervals; the heights of the bars in the histogram are called the distribution of the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Characteristics of a distribution: left-skewed
Negatively skewed; the values to the left of the center fall further away from the center than those to the right of the center; the mean is less than the median
26
Characteristics of a distribution: Right-skewed
Positively skewed; the values to the right of the center fall further away from the center than those to the left of the center; the mean is greater than the median
27
Characteristics of a distribution: symmetric
Left and right sides of the graph are roughtly mirror images of eachother; the center is the mean and the mean ~ the median
28
How to describe graphical data
Center, variation, distribution, outliers, time
29
Independent variables
When two variables are not associated/there is no evident relationship between the two
30
Inference
Making decisions and predictions based on the data
31
Inferential statistics
Are used when data are available only for a sample but we want to make a decision or prediction about the entire population (confidence intervals, signficiance tests)
32
Intensity map (heat map)
Colors are used to show higher and lower values of a variable
33
Multi-stage sample
Clustering, but sample within each cluster rather than the entire cluster
34
Negatively associated
Downward trend between the two poles of the variables
35
Nominal variable
A categorical variable where the levels have no heirarchy; e.x. eye color, type of car
36
Non-response bias
When a sample's recruitment's nonresponse rate is high, so it's unclear if those selected really represent the sample
37
Numerical summaries, location (descriptive statistics)
Mean, median, quantile/percentile, quartile, mode
38
Numerical summaries, spread (descriptive statistics)
Standard deviation, sample variance, range, interquartile range, coefficient of variance
39
Numerical variable
Can take a wide range of number values, and it is sensible to add/subtract/take averages
40
Observational data
No treatment has been explicity applied/witheld in regards to the data collected
41
Observational study
When data is collected in a way that does not interfere with how the data arise; can provide evidence of a naturally occuring association but alone cannot show a causal connection
42
Ordinal variable
A categorical variable where the levels have a natural ordering; e.x. level of education
43
Population
Is the total set of subjects in which we are interested
44
Positively associated
Upward trend between the two poles of the variables
45
Probability
Is the basic tool for evaluating chances and is alsothe key to how well inferential statistics work
46
Qualitative data in a one way table can include
Absolute frequency, relative requency, cumulative frequency, cumulative relative frequency
47
Qualitative data in a two way table can
Indicate the relationship between two variables
48
Random sample reduces…
The change of introducing biases
49
Randomization (experimental design)
Accounts for variables that can't be controlled
50
Randomized experiment
When individuals are randomly assigned to a group in an experiment
51
Relative frequency
Frequency for that category/sum of all frequencies
52
Replication (experimental design)
Can be accomplished via a significantly large sample, or duplicating a study
53
Response variable
The second variable that changes based on the explanatory variable
54
Sample
The subset of the population for whom we have or plan to have data
55
Sampling methods are based in the notion of…
Implied randomness, and tend to be a good reflection of population when each subject in the population has the same chance of being included in that sample.
56
Scatterplot
Represents the bivartiate relationship between two variables (usually continuous variables) by plotting a data point for each observation in the data set; useful fo visualizing the relationship
57
Simple random sampling
Each case in a population has an equal chance of being included in the final sample; knowing a case is included does not provide useful info about what other cases are included (raffle-style)
58
Stratified sampling
Population is divided into strata (similar cases grouped together, like by age), then a second sampling is employed w/in each stratum (useful when cases in stratum are similar in respect to studied outcome)
59
Subjects
The entities that we measure in a study
60
Tabular methods
Table summary with frequency and or precent frequency
61
Types of descriptive statistics
Numerical methods, tabular methods, graphical methods
62
Characteristic of data: center
A representative or average value that indicates where the middle of the data set is located
63
Characteristic of data: variation
A measure of the amount that the data values vary among themselves
64
Characteristics of data: distribution
The nature or shape of the distribution of the data
65
Characteristics of the data: outliers
Sample values that lie very far away from the vast majority of the other sample values
66
Characteristics of data: time
Changing characteristics of the data over time (is there a trend?)
67
Shape of a distribution: Modality
How many prominent peaks are apparent within the distribution
68
Shape of a distribution: unimodal
A single prominent peak in the distribution
69
Shape of a distribution: bimodal
Two prominent peaks in the distribution
70
Shape of a distribution: multimodal
Several prominent peaks in the distribution
71
Shape of a distribution: uniform
No prominent peaks, mostly smooth
72
Mean (measure of center)
A measure of center; the sample mean is denoted as an x with a bar across the top, and the population mean is denoted as the greek letter mu (the little u with a tail)
73
Sample mean (x with bar over it)
A sample statistic that serves as a point estimate of the population mean
74
Variance (measures of variability)
The average squared deviation from the mean; we used the squared deviation to get rid of negatives so that observations equally distant from the mean are weighted equally, and to weigh larger deviation more heavily
75
Standard deviation (measures of variability)
The square root of the variance, and has the same units as the data
76
Median (measures of center)
The value that splits the data in half when ordered in ascending order; if there are an even number observations then the median is the average of the two values in the middle; also called the 50th percentile
77
IQR (measures of variability)
The middle 50% of the data included between the first quartile (25th percent) and the third quartile (75th percent); IQR = Q3 - Q1
78
Box plot
The box represents the middle 50% of the data, the line dissecting the box is the median, the upper and lower whiskers is the full range of the data and any dots are suspected outliers
79
Box plot: Whiskers
Max upper whisker reach = Q3 + 1.5 x IQR; max lower whisker reach = Q1 - 1.5 x IQR
80
Box plot: Outliers
Defined as an observation beyond the max reach of the whiskers, helpful for identifying extreme skew in the distribution, indentifying data collection/entry errors, provides insight into interesting features of data
81
Robust statistics
Median and IRQ are more robust to skewness and outliers
82
For skewed distributions, use…
Median (center) and IQR (spread)
83
For symmetric distributions, use…
Mean (center) and standard deviation (spread)
84
Log transformation
Useful when data is extremely skewed as it can make outliers less prominent, but the results of the analysis might be difficult to interpret because the log of a measured variable is usually meaningless