r_distributions Flashcards

1
Q

Why is it difficult to summarize data into a visualization when the data is numeric?

A
  • When data is not categorical, reporting the frequency of each unique entry is not an effective summary since most entries are unique
    • For example, in a dataset where students report height:
      • only one student reported a height of 68.503937 inches
      • only one student reported a height of 68.8976377952 inches
  • A more useful method is to define a distribution for numerical data that reports the proportion of the data below a value A for all possible values of A
  • This function is called a cumulative distribution function or CDF
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the key to creating smooth density plots?

A
  • very small bins values
  • the smaller the bins get, the rough histogram bars/peaks are smoothed out
  • eventually, once the bins are small enough, the edges are completely gone
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an important difference between historgams and smooth density plots

A
  • Histograms use a count scale
  • Smooth density plots use a frequency scale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some other names for Normal Distribution?

A
  • Bell curve
  • Gaussian distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What percent of values are within 2 sd of the mean in a normal distribution?

A
  • 95%
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the equation for the mean (average)?

A
  • average <- sum(x) / length(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the equation for the standard deviation?

A
  • SD <- sqrt( sum( (x-average)^2) / length(x) )
  • square root
    • of the sum
      • of the differences between the values and the mean
        • squared
          • devided by the length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

‘x’ represents male heights. Determine the following:

  • What proportion of the data is between 69 and 72 inches (taller than 69, but shorter or equal to 72).
A
  • mean( (x > 69) & (x <= 72))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Suppose you have the following information:

  • library(dslabs)
  • data(heights)
  • x <- heights$height[heights$sex==”Male”]
  • avg <- mean(x)
  • stdev <- sd(x)

Use the normal approximation to estimate the proportion of the data that is between 69 and 72 inches.

Note: You can only use ‘avg’ and ‘stdev’, and the ‘pnorm’ function

A
  • pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Suppose you have the following information:

  • library(dslabs)
  • data(heights)
  • x <- heights$height[heights$sex == “Male”]
  • mean(x > 79 & x <= 81)
  1. Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx.
  2. Report how many times bigger the actual proportion is compared to the approximation.
A
  • avg <- mean(x)
  • stdev <- sd(x)
  1. approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
    • approx
  2. exact / approx
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • First, we will estimate the proportion of adult men that are 7 feet tall or taller.
  • Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.
  • Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Print out your estimate; don’t store it in an object.
A
  • 1 - pnorm(84, 69, 3)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a quartile

A
  • cut points dividing the range of a probability distribution into contiguous intervals with equal probabilities
  • There is one less quartile than groups created
  • Thus, quartiles are three cut points that will divide a dataset into four equal-sized groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

John Tukey quote

A
  • “The greatest value of a picture is when it forces us to notice what we never expected to see”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

From the heights varialbe within the heights dataset:

  1. Define a variable male that contains the male heights.
  2. Define a variable female that contains the female heights.
  3. Report the length of each variable.
A
  1. male <- heights$height[heights$sex==”Male”]
  2. female <- heights$height[heights$sex==”Female”]
  3. # Report the length of each variable
    • length(male)
    • length(female)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Given the following:

  • male <- heights$height[heights$sex==”Male”]
  • female <- heights$height[heights$sex==”Female”]

Complete the following:

  1. Create 2 five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors female_percentiles and male_percentiles.
  2. Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
  3. Take a look at the df by printing it.
A
  1. # Create two five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors
    • male_percentiles <- quantile(male, seq(.10, .90, .20))
    • female_percentiles <- quantile(female, seq(.10, .90, .20))
  2. # Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
    • df <- data.frame(females = c(female_percentiles), males = c(male_percentiles))
  3. # Take a look at df by printing it
    • df
How well did you know this?
1
Not at all
2
3
4
5
Perfectly