CRP 109 Stats Lecture 1 Flashcards

1
Q

Data definition

A

Collections of observations, such as measurements, or survey
responses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistics definition

A

The science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions
based on them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Population definition

A

The complete collection of all measurements or data that are
being considered. Typically, it is the complete collection of data that we would like to make inferences about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Census definition

A

The collection of data from every member of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sample definition

A

The subcollection of members selected from a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variable definition

A

A characteristic that varies (changes) across individuals in a
population. The values (observations) recorded collectively
make up the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

parameter definition

A

A numerical measurement describing some characteristic of a
population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

statistic definition

A

A numerical measurement describing some characteristic of a
sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discrete data

A

result when the data values are quantitative and the number of
values is finite (countable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Continuous data

A

result from infinitely many possible quantitative values
(not countable). They can be measured, but not counted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Missing completely at random

A

The likelihood of the data value being
missing is independent of its value or any of the other values in
the data set (any data value is just as likely to be missing as
any other data value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing not at random

A

The missing value is related to the reason that it is missing. Ignoring these could lead to bias in the remaining
values and the results may then become misleading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Simple random sample (SRS)

A

A sample of n subjects selected in such a way that every possible sample of the same size n has the
same chance of being chosen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Designed Experiment

A

We apply some treatment and then proceed to observe its effects on the individuals. The individuals in
designed experiments are called experimental units, and they
are often called subjects when they are people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Observational Study

A

We observe and measure specific characteristics, but
we do not attempt to modify the individuals being studied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Random Sampling Error

A

Occurs when the sample has been selected with a
random method, but there is a discrepancy between a sample
result and the true population result; such an error results from
chance sample fluctuations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Non-Sampling Error

A

The result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased
conclusions, or applying statistical methods that are not
appropriate for the circumstances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Non-Random Sampling Error

A

The result of using a sampling method that is not random, such as using a convenience sample or a
voluntary response sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Frequency (of a class)

A

The number of original values that fall into that class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Frequency Distribution/Table

A

Shows how data are partitioned among
several categories/classes by listing the categories along with
the number (frequency) of data values in each of them
-used to summarize large data sets, see the distribution and identify outliers, and/or have a basis for constructing
graphs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Lower Class Limits

A

The smallest numbers that can belong to each of the
different classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Upper Class Limits

A

The largest numbers that can belong to each of the
different classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Class Boundaries

A

The numbers used to separate the classes, but without
the gaps created by class limits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Class Midpoints

A

The values in the middle of the classes. Each class midpoint
is computed by taking the average of the lower and upper class
limits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Class Width
The difference between two consecutive lower class limits (or boundaries) in a frequency distribution. class width = (max value - min value) / number of classes
26
Relative (or Percentage) Frequency Distribution
relative freq = freq of class / sum of all freq *100 to get percentage freq
27
Cumulative Frequency Distribution
frequency for each class is the sum of the frequencies for that class and all previous classes - class limits are replaced by “less than” expressions that describe the new ranges of values
28
Histogram
A graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data). The horizontal scale represents classes of quantitative data values, and the vertical scale represents frequencies. The heights of the bars correspond to frequency values.
29
Relative Frequency Histogram
A graph that has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies (i.e. proportion or percent)
30
Correlation
Exists between two variables when the values of one variable are somehow associated with the values of the other variable. Correlation does not imply causation.
31
Linear Correlation
Exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line
32
Linear Correlation Coefficient, r
Measures the strength of the linear correlation between the paired quantitative x and y values in a sample. It is sometimes referred to as the Pearson product moment correlation coefficient
33
Scatterplot
A plot of paired (x , y ) quantitative data with a horizontal x -axis and a vertical y -axis
34
Properties of r
-The value is always between −1 and 1, inclusive -If r is close to −1, there appears to be a strong negative correlation. -If r is close to 1, there appears to be a strong positive correlation. -If r is close to 0, there appears to be a weak or no linear correlation. -A value of exactly −1 or 1 implies that all of the data fall exactly on a line (perfect correlation) -If all values of either variable are converted to a different scale, the value of r does not change - Interchange all x values and y values, and the value of r will not change - not designed to measure the strength of a relationship that is not linear -sensitive to outliers
35
Regression
Given a collection of paired sample data, the regression line or line of best fit or least-squares line is the straight line that “best” fits the scatterplot of the data
36
Descriptive Statistics
Methods and tools that summarize or describe relevant characteristics of data
37
Inferential Statistics
Methods and tools that make inferences, or generalizations, about populations
38
Mean
The measure of centre found by adding all of the data values and dividing the total by the number of data values -not resistant to outliers
39
Median
The measure of centre that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. -resistant to outliers (only changes slightly)
40
Mode
-The value(s) that occurs with the greatest frequency. -The mode can be found with qualitative data. -A data set can have no mode, one mode (unimodal), or multiple modes
41
s
sample standard deviation
42
s2
sample variance
43
σ
population standard deviation
44
σ2
population variance
45
Range
-The difference between the maximum data value and the minimum data value -Very sensitive to outliers (not resistant) - does not truly reflect the variation among all of the data values
46
Standard Deviation of a Sample (s)
A measure of how much data values deviate away from the mean T-he value is never negative. It is zero only when all of the data values are exactly the same. -Larger values indicate greater amounts of variation. -not resistant to outliers - units are the same as the units of the original data values
47
Variance
-A measure of variation equal to the square of the standard deviation. -The units are the squares of the units of the original data values -not resistant to outliers -The value is never negative. It is zero only when all of the data values are the same number. -s2 is an unbiased estimator of σ2
48
Chebychev’s Rule
for any data set: -at least 75% of data lies within 2 standard deviations of the mean. -at least 89% of data lies within 3 standard deviations of the mean
49
Empirical Rule
The empirical rule states that for bell-shaped data sets, -approximately 68% of data lies within 1 standard deviation of the mean. -approximately 95% of data lies within 2 standard deviations of the mean. -approximately 99.7% of data lies within 3 standard deviations of the mean
50
Percentiles
Measures of location, denoted P1, P2, . . . , P99, which divide a set of data into 100 groups with about 1% of the values in each group -The 50th percentile, P50, has about 50% of the data values below it and about 50% of the data values above it, corresponding to the median
51
Finding the Percentile of a Data Value
percentile of value x = (number of values less than x) / (total number of values) *100
52
k
percentile being used
53
L
locator that gives the position of a value in a sorted list
54
Pk
kth percentile
55
Converting a Percentile to a Data Value
1. arrange values lowest to highest 2. L = (k/100)n 3. if L whole number, kth percentile is midway between Lth value and the next value in the sorted set of data. i.e. Pk = (Lth value + next value) / 2 4. if L is not whole number, round L up. Pk is the Lth value counting from the lowest in the data set.
56
Quartiles
Measures of location, denoted Q1, Q2, and Q3 which divide a set of data into four groups with about 25% of the values in each group Q1 = P25 Q2 = P50 Q3 = P75
57
Interquartile range (IQR)
(IQR) = Q3 − Q1 -another measure of spread that is less sensitive to outliers
58
5-Number Summary
For a set of data, consists of these five values: 1. Minimum 2. First quartile, Q1 3. Second quartile, Q2 (same as the median) 4. Third quartile, Q3 5. Maximum
59
Constructing a Boxplot
Can be used to identify skewness 1. Find the 5-number summary. 2. Construct a line segment extending from the minimum data value to the maximum data value. 3. Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at the median.
60
Identifying Outliers for Modified Boxplots
1. Find the quartiles. 2. Find the IQR. 3. Evaluate 1.5×IQR. 4. In a modified boxplot, a data value is an outlier if it is: above Q3 by an amount greater than 1.5×IQR; or below Q1 by an amount greater than 1.5×IQR. -A special symbol (such as an asterisk or point) is used to identify outliers as defined previously. -The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier