Week 1 Flashcards

Data Mining

1
Q

difference between the Statisticians and Machine learning

A

Statisticians tend to start by making modelling assumptions about how the data is generated. Generally
these assumptions then give a mathematical framework in which to answer specific questions.

Machine learning people tend to treat the mechanism that generates the data as unknown (or
unknowable) and are happy to use any algorithmic model that gets the job done

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Key steps in data mining

A
  1. Collect data (or get given it).
  2. Wrangle the data into shape.
  3. Train models (the more the better!)
  4. Choose the best model.
  5. Use the best model for prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data wrangling

A

data wrangling consists of doing everything necessary to get datasets ‘tidy’ and ready for
modelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

choose variables (columns) by name

A

select()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

to choose observations (rows) by value.

A

filter()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

to add new variables based of existing variables

A

mutate()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

to reduce multiple values down to a single summary.

A

summaries()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

changes the order of rows

A

arrange()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If you want to rename a column while keeping the other columns

A

rename()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You can remove grouping

A

ungroup()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The function adds a count column instead of summarising

A

add_count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

the function useful for finding the top (or bottom) few entries.

A

slice_min()
and
slice_max()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

function can be used to take a random sample

A

slice_sample()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly