r_distributions Flashcards
Why is it difficult to summarize data into a visualization when the data is numeric?
- When data is not categorical, reporting the frequency of each unique entry is not an effective summary since most entries are unique
- For example, in a dataset where students report height:
- only one student reported a height of 68.503937 inches
- only one student reported a height of 68.8976377952 inches
- For example, in a dataset where students report height:
- A more useful method is to define a distribution for numerical data that reports the proportion of the data below a value A for all possible values of A
- This function is called a cumulative distribution function or CDF
What is the key to creating smooth density plots?
- very small bins values
- the smaller the bins get, the rough histogram bars/peaks are smoothed out
- eventually, once the bins are small enough, the edges are completely gone
What is an important difference between historgams and smooth density plots
- Histograms use a count scale
- Smooth density plots use a frequency scale
What are some other names for Normal Distribution?
- Bell curve
- Gaussian distribution
What percent of values are within 2 sd of the mean in a normal distribution?
- 95%
What is the equation for the mean (average)?
- average <- sum(x) / length(x)
What is the equation for the standard deviation?
- SD <- sqrt( sum( (x-average)^2) / length(x) )
- square root
- of the sum
- of the differences between the values and the mean
- squared
- devided by the length
- squared
- of the differences between the values and the mean
- of the sum
‘x’ represents male heights. Determine the following:
- What proportion of the data is between 69 and 72 inches (taller than 69, but shorter or equal to 72).
- mean( (x > 69) & (x <= 72))
Suppose you have the following information:
- library(dslabs)
- data(heights)
- x <- heights$height[heights$sex==”Male”]
- avg <- mean(x)
- stdev <- sd(x)
Use the normal approximation to estimate the proportion of the data that is between 69 and 72 inches.
Note: You can only use ‘avg’ and ‘stdev’, and the ‘pnorm’ function
- pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
Suppose you have the following information:
- library(dslabs)
- data(heights)
- x <- heights$height[heights$sex == “Male”]
- mean(x > 79 & x <= 81)
- Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx.
- Report how many times bigger the actual proportion is compared to the approximation.
- avg <- mean(x)
- stdev <- sd(x)
- approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
- approx
- exact / approx
- First, we will estimate the proportion of adult men that are 7 feet tall or taller.
- Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.
- Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Print out your estimate; don’t store it in an object.
- 1 - pnorm(84, 69, 3)
What is a quartile
- cut points dividing the range of a probability distribution into contiguous intervals with equal probabilities
- There is one less quartile than groups created
- Thus, quartiles are three cut points that will divide a dataset into four equal-sized groups
John Tukey quote
- “The greatest value of a picture is when it forces us to notice what we never expected to see”
From the heights varialbe within the heights dataset:
- Define a variable male that contains the male heights.
- Define a variable female that contains the female heights.
- Report the length of each variable.
- male <- heights$height[heights$sex==”Male”]
- female <- heights$height[heights$sex==”Female”]
- # Report the length of each variable
- length(male)
- length(female)
Given the following:
- male <- heights$height[heights$sex==”Male”]
- female <- heights$height[heights$sex==”Female”]
Complete the following:
- Create 2 five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors female_percentiles and male_percentiles.
- Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
- Take a look at the df by printing it.
- # Create two five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors
- male_percentiles <- quantile(male, seq(.10, .90, .20))
- female_percentiles <- quantile(female, seq(.10, .90, .20))
- # Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
- df <- data.frame(females = c(female_percentiles), males = c(male_percentiles))
- # Take a look at df by printing it
- df