tidyverse Flashcards

Question

give col sums of a data frame as a vector, as example of map, of course there is an easier way to do this

Answer 1

df |> map_dbl(sum)

Answer 2

group_by() uses data-masking, not tidy-selection. We can work around that problem by using the handy pick() function, which allows you to use tidy-selection inside data-masking functions: df %>% count(pick(starts_with("z"))) #> # A tibble: 3 × 3 count_missing <- function(df, group_vars, x_var) { df |> group_by(pick({{ group_vars }})) |> summarize( n_miss = sum(is.na({{ x_var }})), .groups = "drop" ) }

Answer 3

apropos("replace")

Answer 4

You want apply (see the docs for it). apply(var,1,fun) will apply to rows, apply(var,2,fun) will apply to columns. a c.1..2..3. c.10..0..6. 1 1 10 2 2 0 3 3 6 > apply(a,1,min) [1] 1 0 3

Answer 5

paramlist=list(test.this = .20, vec.test=c("hi","bye")) for (ii.param in 1:length(paramlist)){ assign(x=names(paramlist)[[ii.param]],value = paramlist[[ii.param]]) } assigns test.this to value .20 and vec.test to ("hi","bye")

Answer 6

start.time <- proc.time() cat("It took ", as.numeric(getElement(proc.time(),"elapsed") - getElement(start.time,"elapsed")), "seconds to download the data.\n")

Answer 7

ymd("2017-01-31"), ymd_hms("2017-01-31 20:11:59"), year(), wday() weekday label=TRUE to get Tuesday; expand_dates <- function(df) { df |> mutate( across(where(is.Date), list(year = year, month = month, day = mday)) ) }

Answer 8

durations (exact # of seconds), periods (weeks and months), intervals (start and end point)

Answer 9

mtcars |> summarize(across(everything(), n_distinct))

Answer 10

use break, not break()

Answer 11

as.formula, formula.gbm.without.msa <- reformulate(response="weekpay", termlabels= predictors.including.msas)

Answer 12

x2 <- list(c(5,6,7)) x2 [[1]] [1] 5 6 7 for (f2 in c("mean","min")){ + print(do.call(f2,x2)) + } [1] 6 [1] 5

Answer 13

mtcars %>% ggplot(mapping=aes(x=wt, y=mpg)) + geom_point() with colors for different values of cyl: mtcars %>% ggplot(mapping=aes(x=wt, y=mpg, color=cyl)) + geom_point()

Answer 14

labs(x="mpg", y="hwy", title="A Nice title")

Answer 15

ggplot(mpg, aes(x = drv, fill = drv)) + geom_bar() Note that fill is here set to be equal to the same variable as x.

Answer 16

diamonds %>% ggplot(aes(x=color, fill = cut)) + geom_bar(position="dodge")

Answer 17

geom_freqpoly

Answer 18

facet wrap draws 5 plots, one for each level of marstat: graphtrain %>% ggplot(aes(x=age, y=weekpay)) + geom_point() + facet_wrap(~ marstat)

Answer 19

facet grid, draws grid for levels of 2 variables

Answer 20

ggplot(data = diamonds, mapping = aes(x = cut)) + geom_bar() + coord_flip()

Answer 21

ggplot(smaller, aes(x = carat, y = price)) + geom_bin2d()

Answer 22

ggplot(diamonds, aes(x = cut, y = color)) + geom_count()

Answer 23

ggplot(diamonds, aes(x = y)) + geom_histogram(binwidth = 0.5) + coord_cartesian(ylim = c(0, 50))

Answer 24

defaults to the last plot ggsave(filename, device="pdf") also can do pdf() with print(ggplot() ...)

Answer 25

x <- c("hid", "hai", "hirsute", "bla") grep(pattern="hi",x) [1] 1 3 > grep(pattern="hi",x, value=TRUE) [1] "hid" "hirsute" grep(pattern, x, value = TRUE) to return the values

Answer 26

list_flatten, in purrr

Answer 27

sapply(templist[1:3], FUN="[",1) sapply(tt.cases, FUN= function(x){ x$id})

Answer 28

x <- list(CA = c(1,2), NV =c(3,4), AZ=c(5,6)) state = "CA" eval(parse(text= paste("y<- x$", state, sep=""))) This actually does (and evaluates): y <- x$CA REMEMBER to include the 'text=' part!!

Answer 29

use pick() within something like mutate, so you can use the things you can do with select() (mutate uses data-masking, select use tidy selection pick() provides a way to easily select a subset of columns from your data using select() semantics while inside a "data-masking" function like mutate() or summarise(). pick() returns a data frame containing the selected columns for the current group. my_group_by <- function(data, cols) { group_by(data, pick({{ cols }})) } df %>% my_group_by(c(x, starts_with("z")))

Answer 30

or maybe use quote with ggplot2? plot(x,y, xlab = expression(paste("Text here ", hat(x), " here ", z^rho, " and here")), ylab = expression(paste("Here is some text of ", phi^{rho})), main = "Expressions with Text")

Answer 31

household #> # A tibble: 5 × 5 #> family dob_child1 dob_child2 name_child1 name_child2 #> #> 1 1 1998-11-26 2000-01-29 Susan Jose #> 2 2 1996-06-22 NA Mark NA #> 3 3 2002-07-11 2004-04-05 Sam Seth #> 4 4 2004-10-10 2009-08-27 Craig Khai #> 5 5 2000-12-05 2005-02-28 Parker Gracie Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to, using names_sep to split up each variable name. Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output). household %>% pivot_longer( cols = !family, names_to = c(".value", "child"), names_sep = "_", values_drop_na = TRUE )

Answer 32

pivot_wider() is the opposite of pivot_longer(): it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It’s relatively rare to need pivot_wider() to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by other tools. fish_encounters #> # A tibble: 114 × 3 #> fish station seen #> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 gets converted into fish_encounters %>% pivot_wider( names_from = station, values_from = seen )

Answer 33

False positive, i.e mistaken rejection of null hypothesis False negative or mistaken lack of rejection

Answer 34

Sensitivity: Rate of getting a positive, given that it's a positive Rate of getting a negative, given that it's actually a negative

Answer 35

For loops are slow often because of a copy being made, e.g. subtracting median from every column of a df. But also vectorized functions are written in C code

Answer 36

sample(x, size, replace= FALSE) rnorm(10) sample.int(100,10, replace= FALSE)

Answer 37

x[order(x$B),]

Answer 38

These two are the same: # We can either use starwars |> summarize(across(where(is.character), ~ length(unique(.x)))) # Or we can define our anonymous function like this starwars |> summarize(across(where(is.character), function(x) length(unique(x))))

Answer 39

Note that this is similar to using summarize to get max delay for each group, but this keeps the info for all columns Also not it provides two or more if ties flights |> group_by(dest) |> slice_max(arr_delay, n = 1) |> relocate(dest) #> # A tibble: 108 × 19 #> # Groups: dest [105] #> dest year month day dep_time sched_dep_time dep_delay arr_time #> #> 1 ABQ 2013 7 22 2145 2007 98 132 #> 2 ACK 2013 7 23 1139 800 219 1250 #> 3 ALB 2013 1 25 123 2000 323 229 #> 4 ANC 2013 8 17 1740 1625 75 2042 #> 5 ATL 2013 7 22 2257 759 898 121 #> 6 AUS 2013 7 10 2056 1505 351 2347 #> # ℹ 102 more rows #> # ℹ 11 more variables: sched_arr_time , arr_delay , …

Answer 40

parse_number() is a handy function that will extract the first number from a string, ignoring all other text.

Answer 41

billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25) + scale_y_reverse()

Answer 42

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth()

Answer 43

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_point( data = mpg |> filter(class == "2seater"), color = "red" ) + geom_point( data = mpg |> filter(class == "2seater"), shape = "circle open", size = 3, color = "red" )

Answer 44

df %>% relocate(a, .after = c) df %>% relocate(all_of(c("n_group","gender","group")), .before = sleepy)

tidyverse Flashcards

(69 cards)