Module 1: Introduction to Data Flashcards

Question

Characteristics of a distribution: left-skewed

Answer 1

Negatively skewed; the values to the left of the center fall further away from the center than those to the right of the center; the mean is less than the median

Answer 2

Positively skewed; the values to the right of the center fall further away from the center than those to the left of the center; the mean is greater than the median

Answer 3

Left and right sides of the graph are roughtly mirror images of eachother; the center is the mean and the mean ~ the median

Answer 4

Center, variation, distribution, outliers, time

Answer 5

When two variables are not associated/there is no evident relationship between the two

Answer 6

Making decisions and predictions based on the data

Answer 7

Are used when data are available only for a sample but we want to make a decision or prediction about the entire population (confidence intervals, signficiance tests)

Answer 8

Colors are used to show higher and lower values of a variable

Answer 9

Clustering, but sample within each cluster rather than the entire cluster

Answer 10

Downward trend between the two poles of the variables

Answer 11

A categorical variable where the levels have no heirarchy; e.x. eye color, type of car

Answer 12

When a sample's recruitment's nonresponse rate is high, so it's unclear if those selected really represent the sample

Answer 13

Mean, median, quantile/percentile, quartile, mode

Answer 14

Standard deviation, sample variance, range, interquartile range, coefficient of variance

Answer 15

Can take a wide range of number values, and it is sensible to add/subtract/take averages

Answer 16

No treatment has been explicity applied/witheld in regards to the data collected

Answer 17

When data is collected in a way that does not interfere with how the data arise; can provide evidence of a naturally occuring association but alone cannot show a causal connection

Answer 18

A categorical variable where the levels have a natural ordering; e.x. level of education

Answer 19

Is the total set of subjects in which we are interested

Answer 20

Upward trend between the two poles of the variables

Answer 21

Is the basic tool for evaluating chances and is alsothe key to how well inferential statistics work

Answer 22

Absolute frequency, relative requency, cumulative frequency, cumulative relative frequency

Answer 23

Indicate the relationship between two variables

Answer 24

The change of introducing biases

Answer 25

Accounts for variables that can't be controlled

Answer 26

When individuals are randomly assigned to a group in an experiment

Answer 27

Frequency for that category/sum of all frequencies

Answer 28

Can be accomplished via a significantly large sample, or duplicating a study

Answer 29

The second variable that changes based on the explanatory variable

Answer 30

The subset of the population for whom we have or plan to have data

Answer 31

Implied randomness, and tend to be a good reflection of population when each subject in the population has the same chance of being included in that sample.

Answer 32

Represents the bivartiate relationship between two variables (usually continuous variables) by plotting a data point for each observation in the data set; useful fo visualizing the relationship

Answer 33

Each case in a population has an equal chance of being included in the final sample; knowing a case is included does not provide useful info about what other cases are included (raffle-style)

Answer 34

Population is divided into strata (similar cases grouped together, like by age), then a second sampling is employed w/in each stratum (useful when cases in stratum are similar in respect to studied outcome)

Answer 35

The entities that we measure in a study

Answer 36

Table summary with frequency and or precent frequency

Answer 37

Numerical methods, tabular methods, graphical methods

Answer 38

A representative or average value that indicates where the middle of the data set is located

Answer 39

A measure of the amount that the data values vary among themselves

Answer 40

The nature or shape of the distribution of the data

Answer 41

Sample values that lie very far away from the vast majority of the other sample values

Answer 42

Changing characteristics of the data over time (is there a trend?)

Answer 43

How many prominent peaks are apparent within the distribution

Answer 44

A single prominent peak in the distribution

Answer 45

Two prominent peaks in the distribution

Answer 46

Several prominent peaks in the distribution

Answer 47

No prominent peaks, mostly smooth

Answer 48

A measure of center; the sample mean is denoted as an x with a bar across the top, and the population mean is denoted as the greek letter mu (the little u with a tail)

Answer 49

A sample statistic that serves as a point estimate of the population mean

Answer 50

The average squared deviation from the mean; we used the squared deviation to get rid of negatives so that observations equally distant from the mean are weighted equally, and to weigh larger deviation more heavily

Answer 51

The square root of the variance, and has the same units as the data

Answer 52

The value that splits the data in half when ordered in ascending order; if there are an even number observations then the median is the average of the two values in the middle; also called the 50th percentile

Answer 53

The middle 50% of the data included between the first quartile (25th percent) and the third quartile (75th percent); IQR = Q3 - Q1

Answer 54

The box represents the middle 50% of the data, the line dissecting the box is the median, the upper and lower whiskers is the full range of the data and any dots are suspected outliers

Answer 55

Max upper whisker reach = Q3 + 1.5 x IQR; max lower whisker reach = Q1 - 1.5 x IQR

Answer 56

Defined as an observation beyond the max reach of the whiskers, helpful for identifying extreme skew in the distribution, indentifying data collection/entry errors, provides insight into interesting features of data

Answer 57

Median and IRQ are more robust to skewness and outliers

Answer 58

Median (center) and IQR (spread)

Answer 59

Mean (center) and standard deviation (spread)

Answer 60

Useful when data is extremely skewed as it can make outliers less prominent, but the results of the analysis might be difficult to interpret because the log of a measured variable is usually meaningless

Module 1: Introduction to Data Flashcards

(84 cards)