r_distributions Flashcards

Question 1

Q

Why is it difficult to summarize data into a visualization when the data is numeric?

Answer

A

When data is not categorical, reporting the frequency of each unique entry is not an effective summary since most entries are unique
- For example, in a dataset where students report height:
  - only one student reported a height of 68.503937 inches
  - only one student reported a height of 68.8976377952 inches
A more useful method is to define a distribution for numerical data that reports the proportion of the data below a value A for all possible values of A
This function is called a cumulative distribution function or CDF

Question 2

Q

What is the key to creating smooth density plots?

Answer

A

very small bins values
the smaller the bins get, the rough histogram bars/peaks are smoothed out
eventually, once the bins are small enough, the edges are completely gone

Question 3

Q

What is an important difference between historgams and smooth density plots

Answer

A

Histograms use a count scale
Smooth density plots use a frequency scale

Question 4

Q

What are some other names for Normal Distribution?

Answer

A

Bell curve
Gaussian distribution

Question 5

Q

What percent of values are within 2 sd of the mean in a normal distribution?

Question 6

Q

What is the equation for the mean (average)?

Answer

A

average <- sum(x) / length(x)

Question 7

Q

What is the equation for the standard deviation?

Answer

A

SD <- sqrt( sum( (x-average)^2) / length(x) )
square root
- of the sum
  - of the differences between the values and the mean
    - squared
      - devided by the length

Question 8

Q

‘x’ represents male heights. Determine the following:

What proportion of the data is between 69 and 72 inches (taller than 69, but shorter or equal to 72).

Answer

A

mean( (x > 69) & (x <= 72))

Question 9

Q

Suppose you have the following information:

library(dslabs)
data(heights)
x <- heights$height[heights$sex==”Male”]
avg <- mean(x)
stdev <- sd(x)

Use the normal approximation to estimate the proportion of the data that is between 69 and 72 inches.

Note: You can only use ‘avg’ and ‘stdev’, and the ‘pnorm’ function

Answer

A

pnorm(72, avg, stdev) - pnorm(69, avg, stdev)

Question 10

Q

Suppose you have the following information:

library(dslabs)
data(heights)
x <- heights$height[heights$sex == “Male”]
mean(x > 79 & x <= 81)

Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx.
Report how many times bigger the actual proportion is compared to the approximation.

Answer

A

avg <- mean(x)
stdev <- sd(x)

approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
- approx
exact / approx

Question 11

Q

First, we will estimate the proportion of adult men that are 7 feet tall or taller.
Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.
Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Print out your estimate; don’t store it in an object.

Answer

A

1 - pnorm(84, 69, 3)

Question 12

Q

What is a quartile

Answer

A

cut points dividing the range of a probability distribution into contiguous intervals with equal probabilities
There is one less quartile than groups created
Thus, quartiles are three cut points that will divide a dataset into four equal-sized groups

Question 13

Q

John Tukey quote

Answer

A

“The greatest value of a picture is when it forces us to notice what we never expected to see”

Question 14

Q

From the heights varialbe within the heights dataset:

Define a variable male that contains the male heights.
Define a variable female that contains the female heights.
Report the length of each variable.

Answer

A

male <- heights$height[heights$sex==”Male”]
female <- heights$height[heights$sex==”Female”]
# Report the length of each variable
- length(male)
- length(female)

Question 15

Q

Given the following:

male <- heights$height[heights$sex==”Male”]
female <- heights$height[heights$sex==”Female”]

Complete the following:

Create 2 five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors female_percentiles and male_percentiles.
Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
Take a look at the df by printing it.

Answer

A

# Create two five row vectors showing the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex called these vectors
- male_percentiles <- quantile(male, seq(.10, .90, .20))
- female_percentiles <- quantile(female, seq(.10, .90, .20))
# Then create a data frame called df with these two vectors as columns. The column names should be female and male and should appear in that order.
- df <- data.frame(females = c(female_percentiles), males = c(male_percentiles))
# Take a look at df by printing it
- df

Brainscape's Knowledge GenomeTM

r_distributions Flashcards

Brainscape's Knowledge Genome^TM