2: Exploratory Data Analysis: Single Variable Flashcards

1
Q

cases

A

objects described by a set of data (companies, subjects, customers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

label

A

variable used in some data sets to distinguish different cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

variable

A

characteristic of a case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

distribution

A

of a variable tells us what values it takes and how often it takes these values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

distribution of categorical variable

A

lists the categories and gives either the count or the percent of cases who fall in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

stemplot

A

steam and leaf plot. gives quick pic of distribution shape while includes actual numerical values in graph. separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf – the final digit. write stems in vert. column with smallest at top and draw vert line at right. write each leaf in the row to the right of them stem, in increasing order out from the stem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

histogram

A

breaks range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. classes = equal width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

tails

A

extreme values of a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

modes

A

major peaks in a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

time plot

A

of a variable plots each observation against the time at which it was measured. time is on horiz.. scale of plot and variable measured is on vert. scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

mean vs. median

A

mean is average value.
(x1 + x2+ x3 + xn / n)

median is middle value.

(1) if number of observations is odd – medium’s LOCATION can be found by counting (n+1)/2 observations up from bottom of the list
(2) if even – median is the mean of the two center observations in the ordered list. location is (n+1)/2 observations up from bottom of the list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

quartile

A

upper quartile = median of the upper half of the data. lower quartile = median of lower half of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

pth percentile

A

the value that has p percent of the observations fall at or below it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

five number summary

A

set of observations consists of the smallest observation, the first quartile, the median, the third quartile, the largest observation - from small to big.

Min Q1 M Q3 Max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

boxplot

A

graph of five-number summary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

interquartile range IQR

A

distance b/w first and third quartiles. IQR = Q3-Q1

17
Q

1.5 X IQR rule for outliers

A

observation = outlier if it falls MORE than 1.5 X IQR above third quartile or below first quartile

18
Q

standard deviation

A

measures spread by looking at how far the observations are from their mean.

19
Q

variance

A

s^2 of a set of observation is the average of the squares of the deviations from their mean. OR, the average of the squared differences from the mean.

(1) Work out the Mean (the simple average of the numbers)
(2) Then for each number: subtract the Mean and square the result (the squared difference).
(3) Then work out the average of those squared differences

20
Q

standard deviation

A

= square root of the variance

21
Q

degrees of freedom

A

the number n-1 is called the degrees of freedom of the variance or standard deviation

22
Q

properties of standard deviation

A

(1) s measures spread about the mean and should be used only when the mean is chosen as the measure of center
(2) s = 0 only when there is no spread. this happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger
(3) s, like the mean, is not resistant. a few outliers can make s very large.

23
Q

Which is better for describing a skewed distribution or a distribution with strong outliers: five number summary, mean, or std deviation?

A

five number summary

24
Q

linear transformation

A

changes the original variable x into the new variable xnew given by this equation:

xnew = a + bx

they don’t change the shape of the distribution

25
effects of linear transformation
(1) multiplying each observation by + number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and std dev) by b (2) adding same number a (pos or neg) to each observation adds a to measures of center and to quartiles and other percentiles -- but does NOT change measures of spread
26
density curve
overall pattern of a distribution. has a total area of 1 underneath it.
27
normal distributions
are describes by bell-shaped, symmetric, unimodal density curves. the mean U and std dev completely specify the Normal distrubtion.
28
Mean vs. std dev in normal distribution
mean = center of symmetry std dev = distance from mean to the change-of-curvature points on either side
29
68-95-99.7 rule
In the normal distribution the mean u and std dev, Approx 68% of the observations fall within std dev of the mean Approx 95% of the observations fall within 2 x (std dev) of the mean Approx 99.7% of the observations fall within 3 x (std dev) of the mean
30
z-score
standardized value: subtract the mean of the distribution and then divide by the std dev z = x - u / std dev tells us how many standard devs the original observation falls away from the mean (and in which direction)
31
frequency distribution table
1. frequency (f): number of times we observe an event 2. raw frequency: (f/n): # of times event takes place / total events 3. cumulative freq: running count of the frequencies of a particular value and all preceding values (sum raw freq) 4. cum. relative freq: cumulative freq for a particular value in relation to the total (sum rel freqs)
32
measures of central tendency for cat. variables
median (if cat variables can be ranked) | mode
33
measures of central tendency for quant variables
mean | median if lots of outliers
34
calculate median
1. order data from low to high 2. look at location - (n+1).2 3. if at 5.5, then average 5th and 6th values
35
median provides a ______ reasonable measure of central tendency when distributions are skewed or have outliers
median provides a MORE reasonable measure of central tendency when distributions are skewed or have outliers
36
mean is _____ sensitive to outliers
mean is sensitive to outliers
37
if distribution is exactly symmetric, then mean and median
are the same
38
IQR as a measure of spread is ____ useful to describe skewed distributions
NOT 2 "sides" of a skewed distribution have different spreads
39
standard deviation is _____ a good measure when the distribution is highly skewed
NOT