Data Representation Flashcards

1
Q

Representation of Data,

qualitative (categorical) variable

quantitative variable

types of quantitative variables;
type 1
type 2

A

Unable to take a numerical value

Can take a numerical value

Continuous quantitative variable:
Variable which can take any value in a given range

Discrete quantitative variable:
Variable has clear steps (gap) between its possible values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Measures of central tendency,

mean;
description
formula

median;
description
process of finding

mode,
description
process of finding

advantages of the measures of central tendency;
mean
median
mode

disadvantages of the measures of central tendency;
mean
median
mode

A

The mean [ x̄ or E(X) ] of a data set is equal to the sum of values in the data set divided by the number of values, if the total number of values is n and we use the sigma notation ( Σ ) for the sum of all the values

x̄ = Σx / n = ( X1 + X2 + … + Xn ) / n

The median is defined to be the middle value of the data

After sorting the values numerically, if n is odd the median is the middle value else it is the sum of the 2 middle values divided by 2.

The mode is defined to be the most common, or most frequently occuring value in a data set

The mode is the value that appears the most, there may also be multiple modes if there are multiple values that occur the most, likewise there may be no mode

Mean (+):
– Uses all values
– Can be found on calculator easily

Median (+):
– Not influenced by extreme high or low values

Mode (+):
– Good for finding the most popular value

Mean (–):
– Influenced by extreme high or low values

Median (–):
– Difficult to work if there are a large number of values

Mode (–):
– May not at all be representative of a set of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Measures of spread,

description

range;
description
formula

inter-quartile range (IQR);
description
process

standard deviation and variance;
description
3 formulas
process of finding from the table
standard formula for variance

coded data;
mean formula
standard deviation formula
variance formula

combined sets of data;
mean
standard deviation
variance

aplications;
mean formula
standard deviation formula
variance formula

A

The range, inter-quartile range, standard deviation and variance tell us about the spread of a data set

Range:
The range is the difference between the largest and smallest value

maximum - minimum = range

Inter-quartile range:
The upper quartile and lower quartile together with the median, seperate a set of data values into 4 quarters

The lower-quartile is the value at which 25% or a quarter of the values lie, it can be found be finding the value in the middle of the minimum value and the median or the value corresponding with 0.25n

The upper-quartile is the value at which 75% or three quarters of the values lie, it can be found be finding the value in the middle of the maximum value and the median or the value corresponding with 0.75n

The IQR is the upper-quartile minus the lower-quartile, or the range of the quartiles.

Standard deviation [ σ of Sd(X)], Variance [σ^2 or V ar(X)]:
The standard deviation of a set of data values is a measure of their spread about the mean.

σ = √[ Σ(x -  x̄)^2 ] / n 
σ = √( Σx^2 / n ) - ( x̄ )^2
σ = √[ Σx^2 - n( x̄ )^2 ] / n
  1. Find the mean
  2. Write down the difference of each value from the mean, (x - x̄)
  3. Square the difference, ( x - x̄ )^2
  4. Average the squares, ( x - x̄ )^2 / n
  5. Square the values to reverse the effect squaring

Variance = σ^2 = Σ( x - x̄ )^2 / n

Coded mean formula:
x̄ = [ Σ( x ± a ) / n ] ∓ a

Coded standard deviation formula:
σ = √{ [ Σ( x ± a )^2 ] / n} - { [ Σ( x ± a ) / n ] ^2 }

Coded variance formula:
σ^2 = { [ Σ( x ± a )^2 ] / n} - { [ Σ( x ± a ) / n ] ^2 }

Combined mean formula:
x̄ + Ȳ = Σx + Σy / Nx + Ny

Combined standard deviation formula:
σx + σy = √{[Σx^2+Σy^2] /Nx+Ny} - {[(Σx + Σy )^2 /Nx+Ny]^2}

Combined variance formula:
σ^2x + σ^2y =
{[Σx^2+Σy^2] /Nx+Ny} - {[(Σx + Σy )^2 /Nx+Ny]^2}

Applied mean formula:
x̄ = 1/a [ (Σax ± b / n) ∓ b]

Applied standard deviation formula:
σ = 1/a [ √{ [ Σ( ax ± b )^2 ] / n} - { [ Σ( ax ± b ) / n ] ^2 } ]

Applied variance formula:
σ^2 = 1/a^2 { [ Σ( ax ± b )^2 ] / n} - { [ Σ( ax ± b ) / n ] ^2 }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stem and leaf diagram,

description

steps for construction

notes for constructing a stem and leaf diagram

back to back stem and leaf diagram description

back to back stem and leaf diagram construction steps

points to consider when comparing

A

A useful way of organising data as it is collected, it can show the distribution of the data because it is arranged from the smallest to largest value, making it easy to locate the quartiles and median

  1. Rearrange data values in order
  2. Place first digits of values in order on vertical stem
  3. Place remaining digits in order horizontally along leaf

a) give the diagram a title
b) show values of median, LQ, UQ, IQR, n, range
c) write a key for the diagram

A back to back stem and leaf diagram is useful to compare 2 different groups of data values

  1. Rearrange data values in order
  2. Place first digits of values in order on vertical stem for both of the groups,
  3. Place the remaining digits of the first group in order horizontally along leaf, to the right of the stem
  4. Place the remaining digits of the second group in order horizontally along leaf, to the left of the stem

Comment on:

  1. Median
  2. Range
  3. Minimum and maximum values
  4. IQR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Box and whisker diagram,

description

steps for construction

notes for constructing

outlier definition

A

A box and whisker diagram shows the location of 5 values from a distribution plotted against a scale in order;
smallest value, LQ, median, UQ, largest value

  1. Rearrange the data values in order
  2. Find the smallest value
  3. Find the LQ
  4. FInd the median
  5. Find the UQ
  6. Find the largest value
  7. Rule suitable scale
  8. Mark points
  9. Draw box and whisker diagram
  10. Mark any outliers outside the the diagram and label

a) give the diagram a title
b) show the min, max, med, LQ, UQ, IQR, n, mean, range

An outlier is defined as any value exceeding 1.5x the UQ or less than 1.5x the LQ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Grouping data,

description

description of information collected from group data

A

When dealing with data that has been measured in some way like height or weight nearly all the values will be different, to show how the heights of people are distributed the most you can do in collect people with similar heights, this process is called grouping, the values are collected in a class interval and end up with data between 2 values

Grouping data from class intervals means some data will be lost, looking at the table would not allow you to know the exact values, the number of observations would be known (frequency showing the number of data groups), the value obtained would be measured to the nearest ____ and whatever that unit or increment is divided by 2 and added and subtracted from the 2 class grouping values would be the “class boundaries”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Class boundaries,

common example or rule of thumb

case exceptions (note be)

types of class boundaries;
type 1
example of type 1
type 2
example of type 2
type 3
example of type 3
limitation of type 3

note for age

A

There is no universally accepted answer for a class boundary but the common value used is 0.5

The class boundaries, for example, for 0 - 9 would be:
-0.5 ≤  x < 9.5

Gap:
given –> 160 - 164, 165 - 169, 170 - 174 …
159.5 ≤ x < 164.5 , 164.5 ≤ x < 169.5 , 169.5 ≤ x < 174.5

No gap:
given –> 55 - 60, 60 - 65, 65 - 70 …
55 ≤ x < 60 , 60 ≤ x < 65 , 65 ≤ x < 70

Open ended:
given –> 17 - 20, 20 - 23, 23 -

The problem with this frequency distribution is that the last class is open-ended, meaning you can not deduce the correct class boundaries unless you know the individual data values, a reasonable procedure when dealing with this is to take the width of the last interval to be twice that of the previous one

Age is recorded to the number of completed years so for example the class interval 17 - 20 contains those who passed the test from the day of their 17th birthday and up to but not including the day of their 20th birthday

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Histogram,

when likely used

description or key factors

drawing a histogram

A

When a grouped frequency distribution contains continuous data one of the most common forms of graphical display is a histogram

a) The bars have no spaces between them
b) The area of each bar is proportional to the frequency

  1. Give the histogram a title
  2. Label the y-axis frequency density
  3. Label the x-axis with its corresponding variable and the unit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Frequency density,

proportionality between block area and frequency

frequency density/ height formula

frequency table

mean formula with frequency

two standard deviation formulas with frequency

A

The simplest way to make the area of a block proportional to the frequency is to make the area of the block equal to the frequency

frequency density (or height) = frequency / class width

Variable | Class boundaries | Class width | Frequency (f) | Frequency density | Midpoint (x) | xf
with Totals labelled : Σf and for Σxf

x̄ = Σxf / Σf

σ = √[ Σf(x)^2 / Σf] - (x̄)^2 
σ = √[Σf( x - x̄)^2] / Σf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cumulative frequency diagram,

description

finding the median, UQ and LQ from the diagram;
median
LQ
UQ

cumulative frequency table example

note for final value of cumulative frequency

notes for drawing a cumulative frequency diagram

A

The cumulative frequencies are plotted against the upper class boundaries of the corresponding class, it tells us how many values are less than the given measurement (continuous data), too fill in the cumulative frequency table, add up the numbers in the frequency column, up to and including the required position

Median:

  1. Calculate half the total frequency (50%)
  2. Locate this number on the vertical axis
  3. Draw a horizontal line across to the cumulative frequency graph then down to the x-axis

LQ:

  1. Calculate three quarters of the total frequency (75%)
  2. Locate this number on the vertical axis
  3. Draw a horizontal line across to the cumulative frequency graph then down to the x-axis

UQ:

  1. Calculate one quarter of the total frequency (25%)
  2. Locate this number on the vertical axis
  3. Draw a horizontal line across to the cumulative frequency graph then down to the x-axis

Variable | Frequency (f) | Variable | Cumulative Frequency (total number < x)

Variables:
47 - 54
55 - 62
63 - 66
67 - 74
75 - 80
81 - 92
Frequencies:
4
7
8
7
8
4
Variable:
< 46.5 
< 54.5
< 62.5
< 66.5
< 74.5 
< 80.5
< 92.5
Cumulative frequency:
0
4
11
19
26
34
38

The final value of the cumulative frequency should be the same as the total number of eg. people

To then draw the graph the points are plotted (x,y) with x being the variable and y being the cumulative frequency eg. (46.5, 0) , (54.5, 4) etc.
Also ensure your graph is big enough to estimate the median, LQ and UQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Choosing how to represent data,

Advantages of diagrams;
diagram 1
diagram 2
diagram 3
diagram 4
Disadvantages of diagrams;
diagram 1
diagram 2
diagram 3
diagram 4
A

Stem and leaf (+) - Discrete:
Contains all the original data values so the range, median and quartiles can be easily found from it, as well as calculating the mean and standard deviations

Histogram (+) - Continuous:
For large data sets you can make a frequency table and draw the histogram to show the shape of the distribution, it can also group the data into classes of any width

Cumulative frequency (+) - Continuous:
It is useful for estimating the number of data values that lie below or above a given value of the variable

Box and whisker (+) - Continuous:
Gives the lowest, highest, median and quartile values directly and is also very useful when comparing several related data sets

Stem and leaf (-) - Discrete:
For large data sets it becomes difficult to draw and can look confusing because it contains so much information

Histogram (-) - Continuous:
Some of the information of the original data set is lost therefore making the values of the mean, median, quartiles and standard deviations estimates rather than exact values

 Cumulative frequency (-) - Continuous:
The values of the mean, median, quartiles and standard deviations estimates rather than exact values

Box and whisker (-) - Continuous:
Does not provide the mean and standard deviations from it and gives no indication of the size of the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distribution and skewness,

description

type 1;
shape in reference to histogram
description
formula
in reference to box and whisker plot
type 2;
shape in reference to histogram
description
formula
in reference to box and whisker plot
type 3;
shape in reference to histogram
description
formula
in reference to box and whisker plot
A

Another important feature of a set of data is its shape when represented and a frequency diagram, there are 3 different shapes that commonly occur when you draw histograms or bar charts for data sets

Symmetrical (zero-skewness):
The histogram is symmetrical, bell shaped
UQ - Median ≈ Median - LQ
Median in the centre of the box and whisker plot

Positive skew (skewed positively):
Histogram is not symmetrical and has a tail stretching towards the higher values, kind of like a negative parabola that stretched horizontally when nearing the x-axis
UQ - Median > Median - LQ
Median nearer to the left of the box and whisker plot

Negative skew (skewed negatively):
Histogram is not symmetrical and has a tail stretching towards the lower values, kind of like a positive parabola that starts of with a line horizontally along the x-axis
UQ - Median < Median - LQ
Median nearer to the right of the box and whisker plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly