r_dplyr Flashcards
What is dplyr and what functions does it include?
- a grammar of data manipulation, providing a consistent set of verbs
- Includes:
- mutate()
- select()
- filter()
- summarise()
- arrange()
Store within the variable ‘s’:
- mean(height) for “Male”
- sd(height) for “Male”
s <- heights %>%
filter(sex == “Male”) %>%
summarize( average = mean(height), standard_deviation = (height) )
s
Given the following, print out the “average” and “standard_deviation”:
s <- heights %>%
filter(sex == “Male”) %>%
summarize( average = mean(height), standard_deviation = (height) )
- s$average
- s$standard_deviation
For the murders dataset, calculate the us_murder_rate ( total divided by population)
us_murder_rate <- murders %>% +
summarize(rate = sum(total) / sum(population) * 100000 )
Calculate the mean and sd of “height” within the heights dataset, grouping by “sex”
heights %>% +
group_by(sex) %>% +
summarize(average = mean(height), standard_deviation = sd(height) )
Calculate the median_rate for the “murder_rate” variable within the murders dataset, grouped by “region”
murders %>% +
group_by(region) %>% +
summarize(median_rate = median(murder_rate) )
Order the “murders” dataset by “population, and then print the first five rows
murders %>% arrange(population) %>% head()
Order the “murders” dataset by “murder_rate” in descending order, and then print the first five rows
murders %>% arrange(desc( murder_rate) ) %>% head()
Order the “murders” dataset by “region” first and then by “murder_rate” second, and then print the first five rows
murders %>% arrange(region, murder_rate) %>% head()
Show the top 10 murder rates within the “murders” dataset (not ordered)
murders %>% top_n(10, murder_rate)
Show the top 10 murder rates from the “murders” dataset in descending order
murders %>% arrange(desc (murder_rate) ) %>% top_n(10)
Filter the NHANES dataset so that:
- assign this new data frame to the object “tab”
- only “ 20-29” year old “females” are included
- save the average and standard deviation of systolic blood pressure (BPSysAve) as average and standard_deviation
- return the average as a numeric value (not an object)
tab <- NHANES %>%
filter(AgeDecade == “ 20-29”, Gender == “female”) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE),
standard_deviation = sd(BPSysAve, na.rm = TRUE)) %>%
.$average
Filter the NHANES dataset so that:
- assign this new data frame to the object “tab”
- only “ 20-29” year old “females” are included
- save the min and max of systolic blood pressure (BPSysAve) as “min” and “max”
NHANES %>%
filter(AgeDecade == “ 20-29” & Gender == “female”) %>%
summarize(min = min(BPSysAve, na.rm=TRUE), max = max(BPSysAve, na.rm=TRUE))
Filter the NHANES dataset so that:
- Use the functions filter, group_by, summarize, and the pipe %>% to compute the average and standard deviation of systolic blood pressure for females for each age group separately.
- Within summarize, save the average and standard deviation of systolic blood pressure (BPSysAve) as average and standard_deviation.
NHANES %>%
filter(Gender == “female”) %>%
group_by(AgeDecade) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
Filter the NHANES dataset so that:
- Compute the average and standard deviation for each value of Race1 for males in the age decade 40-49.
- Order the resulting table from lowest to highest average systolic blood pressure.
- Use the functions filter, group_by, summarize, arrange, and the pipe %>% to do this in one line of code.
- Within summarize, save the average and standard deviation of systolic blood pressure as average and standard_deviation.
NHANES %>%
filter(Gender == “male”, AgeDecade == “ 40-49”) %>%
group_by(Race1) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE)) %>%
arrange(average)