Ch.14, Descriptive Statistics Flashcards

1
Q

Define descriptive stats

A

Describe data in ways rear give us a better idea of their charachteristics; Number that summarizes a set of data
NOT a correlation statistic (correlations are inferential)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the simplest measure of dispersion?

A

Range: take maximum — minimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are data matrices?

A

Putting data into a grid: a matrix
Opportunity to exam all data in one place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Histograms

A

graphical display of values where each bar indicates the frequency of the range or value
LIMITATIONS: The more “accessible” a data set is, the less information/less complexity you’re conveying
Advantages: identifies mode, helps to identify potential outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Binning

A

Binning in data mining is a data preprocessing technique that involves grouping data into smaller, more manageable categories or bins. It can be used for both numerical and categorical data and can help improve the efficiency and accuracy of data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Stem-Plots

A

both a graph and a chart that displays each score in a data set so that it visually represents the distribution/ frequency of scores
Stem: leading numbers
Leaves: trailing numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does sigma mean and what it is its symbol?

A

Σ= sum of all scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does –
x. mean?

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mean, advantages/disadvantages

A

Advantages: very common, takes into account every entry of a data set
Disadvantages: extremely influenced by outliers, knowledge about individual cases is completely lost with average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Population vs. sample mean

A

CAN NEVER REALLY KNOW THIS, whatever you’re trying to make a generalization about ; Population Mean: (mu greek symbol is the population mean) mean of the entire population (on charts)
Sample Mean: mean of your sample (on charts)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Median, advantages/disadvantages

A

Middle (from lowest to highest)
At the median half the data set is below that number and half the data set is above that number
Position of Median = number of entries + 1/ 2
Odd Number of Entries: median is the middle data entry
^^Even Number of Entries: median is the mean of the two middle data entries
Advantages: not influenced by outliers, reasonable estimate of what most people mean by the center of a distribution “reasonable” average salary in Canada not including billionaires
Disadvantages: may not be good to ignore extreme values in all cases;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Disadvantages, Advantages, Mode

A

LEAST USED, NOMINAL/CATEGORICAL VARIABLE Mode
Most frequently occurring; if there is no entry that is repeated there is no mode
Data can be bimodal, 3 OR MORE MODES= MULTI-MODAL)
Elections use this often to represent who said what party the most/ ask what most popular dish at a cafe
Advantages: most frequently obtained score which can be useful, not influenced by extreme scores and works when outliers aren’t relevant
Disadvantages: may not represent a large proportion of the scores, there’s still a bunch of answers that might be very frequent as well and it completely ignores those

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Advantages/disadvantages of range

A

Range can never be negative: ALWAYS HAS TO BE ABSOLUTE VALUES
Advantages: includes all the data, simple,
Disadvantages: sensitive to small sample sizes, if you have a small sample of a broader population you wouldn’t get the full range in your small (small samples = less range), small samples = not a representative range, doesn’t tell you anything about where the bulk of the values are and is affected by outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are interquartiles?

A

INTERQUARTILES SHOW DISPERSION AROUND MEDIAN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a quartile?

A

Quartiles: positions in a range of values representing multiples of 25%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the first and third quartile?

A

First Quartile: 25% of scores fall below the first quartile, 75% above (Q1: splitting bottom half in half)
Third Quartile: 75% of scores fall below the third quartile, 25% fall above (Q3: splitting top half in half)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the second quartile do?

A

Measure of distance between the first and third quartile (special kind of range that includes just the middle 50% of values) WHERE THE MIDDLE HALF OF THE DATA IS; TELLS YOU WHERE THE MIDDLE IS (25%, 30%, 30%, 25%) Interquartile Range would be 30%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Deviation Calculation

A

DEVIATION IS CALCULATING ONE SCORE’S DISTANCE FROM THE MEAN; STANDARD DEVIATION IS CALCDifference between each score and the mean of the data set
*Deviation shows dispersion around the MEAN rather than the median
Deviation of x (any given score) = x (that score) — x (the average)
(1) First, find the mean
(2) Then determine the deviation with the above formula
(3) Deviation scores all together should always sum to zero

19
Q

Variance calculation

A

Isn’t usually reported because it’s not informative, not as usual as standard deviation: standard deviation is much more useful
Single number representing the average amount of variation in a set of scores
(1) Find mean of the data set
(2) Find deviation of each entry
(3) Square each deviation
(4) Add to get sum of squares
(5) Divide by n-1 to get the sample variance

20
Q

Standard Deviation Calculation

A

Average of the deviations
Measure of the spread of scores out from the mean of sample
Calculate Variance then take square root of it
(1) Find mean of the data set
(2) Find deviation of each entry
(3) Square each deviation
(4) Add to get sum of squares
(5) Divide by n-1 to get the sample variance
(6) find the square root of the variance

21
Q

What measures of central tendency do NOIR use?

A

Nominal: only use the mode
Ordinal: use mode and median (The mean cannot be computed with ordinal data. Finding the mean requires you to perform arithmetic operations like addition and division on the values in the data set. Since the differences between adjacent scores are unknown with ordinal data, these operations cannot be performed for meaningful results)
Interval: use mode, median, and mean
Ratio: use mode, median, and mean

22
Q

What are confidence intervals?

A

the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way. Range of values based on sample data likely to contain a true value
The confidence interval provides a sense of the size of any effect. The figures in a confidence interval are expressed in the descriptive statistic to which they apply (percentage, correlation, regression, etc.). This effect size information is missing when a test of significance is used on its own.

23
Q

Skewed Distribution

A

distributions that are not normal; large amount of scores are clumped at either end

24
Q

Left skewed distribution

A

majority of the numbers clumped at the right, with the long tail pointing toward the left

25
Q

Right skewed distribution

A

Majority of numbers at left, with tail pointing toward right

26
Q

Median Cut-Point

A

Cut Point = sample (n)
+ 1 / 2
Provides location of the median in data set
Not affected by outliers

27
Q

Absolute Frequeny

A

Absolute Frequency: adding each case Raw counts of the number of cases associated with each value

28
Q

Relative Frequency

A

Percentage of cases associated with particular value or category (how much of the data responded that way (25% responded this way, 15% responded this way etc. ADD TO 100%)

29
Q

Cumulative Frequency

A

sum of cases associated with a value/category and all classes below it

30
Q

Cumulative Percentage

A

what percent of people spent ten hours together or less (pick a mark and go up or down) what percent of people spent ten hours or more together ALWAYS COUNTS FROM LOW TO HIGH,

31
Q

Sum of squares

A

dispersion of the data set found by:
1. subtracting each number from the mean
2. squaring that number
3. adding those squares
ss= Σ (x – x with line over top) squared

32
Q

What are Z scores useful for?

A

allows us to compare one score to another
can compare a score to another data sample that uses a different score

33
Q

Scatter Plots

A

visual representation between two variables in which each value is represented as a dot

34
Q

Line of Best Fit

A

line that minimizes the distance between the line and actual data point

35
Q

Where is the mean located on an asymmetrical distribution?

A

Mode is always furthest away from tail
Mean is closest to the tail (most affected by extreme cases CLOSER TO TAIL = MORE AFFECTED ) mean is better used on normal distributions/symmetrical distributions
Median is always between these two

36
Q

What is the standard deviation in a normal distribution?

A

Two thirds of the data, in normal distribution, are within ONE standard deviation on either side of the mean (WITHIN -1 AND +1 STANDARD DEVIATIONS ON EITHER SIDE OF THE MEAN)

37
Q

Q Position

A

QNumber (n+1)/4

38
Q

IQR

A

IQR= Q3—Q1

39
Q

High-End Outliers

A

HIGH END Outliers: any scores that are LARGER, not equal to, than Q3 + (1.5 x IQR)

40
Q

Low-End Outliers

A

LOW END OUTLIERS : Scores less than, NOT EQUAL TO, Q1– (1.5 X IQR)

41
Q

Box and Whisker Plot

A

“Whiskers”: smallest value within lower inner fence DRAWN TO THE LARGEST VALUE IN THE DATA SET THAT IS STILL WITHIN THE FENCE

42
Q

Scatterplots

A

Describing association between two variables
Each individual dot is a person or dyad: has two values (X and a Y)
Shows direction and strength of relationships
Shows degree of covariance

43
Q

Rounding rule for Q1 and Q3?

A

round up to nearest 0.5 for Q1, down to nearest 0.5 for Q3

44
Q
A