Data & Management of Data Flashcards

Question

The replacing with upper/lower cap approach to handling outliers involves...

Answer 1

Replacing the data with the upper/lower limit

Answer 2

Applying a logarithm to the outlier data

Answer 3

Deleting any observations with outliers

Answer 4

A discrepancy or deviation in the data from the actual values

Answer 5

Unavoidable fluctuations in the data e.g. extraneous noise

Answer 6

Repeatable errors that can be associated with a cause

Answer 7

Rescaling data to have a mean of 0 and standard deviation of 1.

Answer 8

Rescaling data to have a common scale - typically between 0 and 1.

Answer 9

Log transformation

Answer 10

Categorical data is converted into separate columns, with their presence indicated by a binary Boolean value

Answer 11

Categorical data is converted to a number

Answer 12

Categorical data is converted to a real-world numerical version - e.g. number of years someone has been in education, so 'high school' = 12, 'bachelors' = 14, etc.

Answer 13

Continuous data is converted to a discrete or finite version - e.g. height and weight converted to BMI, then stored as BMI groups

Answer 14

Eliminate noise or fluctuation in the data

Answer 15

Different sections of the dataset are taken, from which a local moving average is sampled and stored

Answer 16

Categorical data with a frequency - e.g. the number of people who say that a fruit is the tastiest

Answer 17

Horizontal bar chart, which can be used to add an extra dimension to the data - e.g. male vs female responses

Answer 18

Visualising change over time in a continuous category

Answer 19

Visualising relationships between two variables

Answer 20

Size of scatter points, colour of scatter points, visual representation

Answer 21

Visualising proportionality of data

Answer 22

Visualising frequency of something over time or at different points

Answer 23

A technique to remove some of the dimensions from the data, to allow the model to focus only on important data and to remove noise

Answer 24

The removal or filtering of features from the dataset that are redundant or unnecessary for prediction

Answer 25

Filtering out features using some form of metric

Answer 26

Applying a search to the dataset, looking for redundant features

Answer 27

Methods that are embedded into the working of the model like regularisation or decision tree pruning

Answer 28

A filter method where we remove features with low variance, since they likely contain little information

Answer 29

A wrapper method that creates a set of models that each have one feature, selecting the best one, then creating another set without that model, repeating iteratively until we have a set of features

Answer 30

By creating a decision tree, we leave behind features with impure leaves, leaving only the most efficient features in the decision process

Answer 31

A method of extracting useful combinations of features that can be mixed together to produce more powerful representations

Answer 32

Methods with linear activation functions

Answer 33

Methods with non-linear activation functions

Answer 34

A linear method where we find an orthogonal coordinate transformation such that every new coordinate is very important

Answer 35

We graph our dataset, given that each feature is a new dimension, and find some transformation that reduces the dimensionality of our data while improving accuracy

Answer 36

Non-linear dimensionality reduction methods that aren't suited for the learning process, only data visualisation

Answer 37

They take the distribution of distances between each point in the dataset, and scatter them along 2 or 3 dimensions randomly, adjusting them iteratively until the distribution resembles D

Answer 38

UMAP is much faster, adjusting each step slightly to reduce memory and time consumption.

Data & Management of Data Flashcards

(62 cards)