Data Wrangling Flashcards
General Form :
DT[i, j, by]
Take DT, subset rows using i, then calculate j grouped by by
DT[3:5,] or DT[3:5]
Subsetting rows : Selects third to 5th row
DT[ V2 == “A”]
DT[ V2 %in% c(“A”, “C”)]
Selects all rows that have value A in column V2.
Select all rows that have the value A or C in column V2.
DT[,V2]
Select 1 column in j. Column V2 is returned as a vector.
[1] “A” “B” “C” “A” “B” “C” …
DT[,.(V2,V3)]
Select several columns in j.Columns V2 and V3 are returned as a data.table.
V2 V3
1: A -1.1727
2: B -0.3825
3: C -1.0604
.() or list()
.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.
DT[,sum(V1)]
Call functions in j.Returns the sum of all elements of column V1 in a vector.
[1] 18
DT[,.(sum(V1),sd(V3))]
Computing on several columns. Returns the sum of all elements of column V1 and the standard deviation of V3 in a data.table.
V1 V2
1 : 18 0.7634655
DT[,.(Aggregate = sum(V1), Sd.V3 = sd(V3))]
Assigning column names to computed columns.
The same with previous card, but with new names.
Aggregate Sd.V3
1 : 18 0.7634655
DT[,.(V1, Sd.V3 = sd(V3))]
Columns gets recycled if different length. Selects column V1, and compute std. dev. of V3, which returns a single value and gets recycled.
V1 Sd.V3 1: 1 0.7634655 2: 2 0.7634655 ... 11: 1 0.7634655 12: 2 0.7634655
DT[,{print(V2)
plot(V3)
NULL}]
Multiple expressions can be wrapped in curly braces. Print column V2 and plot V3.
[1] "A" "B" "C" "A" "B" "C" ... #And a plot
DT[,.(V4.Sum = sum(V4)),by=V1]
Doing j by group. Calculates the sum of V4, for every group in V1.
V1 V4.Sum
1:1 36
DT[,.(V4.Sum = sum(V4)),by=.(V1,V2)]
Doing j by several groups using .(). The same with previous cards, but for every group in V1 and V2.
V1 V2 V4.Sum
1: 1 A 8
2: 2 B 10
3: 1 C 12
4: 2 A 14
5: 1 B 16
6: 2 C 18
DT[,.(V4.Sum = sum(V4)),by=sign(V1-1)]
Call functions in by. Calculates the sum of V4, for every group in
sign(V1-1).
sign V4.Sum
1: 0 36
2: 1 42
DT[,.(V4.Sum = sum(V4)),
by=.(V1.01 = sign(V1-1))]
Assigning new column names in by. Same as previous card, but with a new name for the variable we are grouping by.
V1.01 V4.Sum
1: 0 36
2: 1 42
DT[1:5, .(V4.Sum = sum(V4)), by=V1]
Grouping only on a subset by specifying i.Calculates the sum of V4, for every group in V1, after subsetting on the first five rows.
V1 V4.Sum
1: 1 2
2: 9 6
DT[,.N,by=V1]
Using .N to get the total number of observations of each
group. Count the number of rows for every group in V1.
V1 N
1: 1 6
2: 2 6
DT[, V1 := round(exp(V1),2)]
Adding/updating a column by reference using := in one line. Watch out: extra assignment (DT
DT[, c(“V1”,”V2”) := list (round(exp(V1),2), LETTERS [4:6])]
Adding/updating several columns by reference using :=. Column V1 and V2 are updated by what is after :=.
Returns the result invisibly. Column V1 changed as above. Column V2 went from: [1] “A” “B” “C” “A” “B” “C” …to:[1] “D” “E” “F” “D” “E” “F” …
DT[, ‘:=’ (V1 = round(exp(V1),2),
V2 = LETTERS[4:6])] [ ]
Using functional :=. Another way to write the same line as previous card, but easier to write comments side-by-side. Also, when [ ] is added the result is printed to the screen.
Same changes as line above this one, but the result is printed to the screen because of the [ ] at the end of the statement.
DT[, V1 := NULL]
Remove a column instantly using :=. Removes column V1.
Returns the result invisibly. Column V1 became NULL.
DT[, c(“V1”,”V2”) := NULL]
Remove several columns instantly using :=.
Removes columns V1 and V2.
Returns the result invisibly. Column V1 and V2 became NULL.
2 DT[, Cols.chosen := NULL]
Cols.chosen = c(“A”,”B”)
2 Watch out: this deletes the column with column name Cols.chosen.
Wrap the name of a variable which contains column names in parenthesis to pass the contents of that variable to be deleted.
setkey(DT,V2)
Use setkey() to set a key on a DT. The data is sorted on the column we specified by reference. A key is set on column V2.