Data & Management of Data Flashcards

(62 cards)

1
Q

What is data integration?

A

Data integration is the process of integrating data from separate sources into a single, coherent data store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean for a data source to be heterogeneous?

A

Separate from the main store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean for a data source to be homogeneous?

A

Contained within one location/data store, due to reasons such as incompatibility of version or file type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why may a data store be homogeneous?

A

Incompatibility, either version (both) or type (file-based systems)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Scientists may perform data integration using a common user interface, which entails…

A

A data manager (interface) that handles every step of the integration process, from retrieval to presentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scientists may perform data integration using a middleware data integration, which entails…

A

Using middleware software to bridge and facilitate communication between different homogeneous systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scientists may perform data integration using an application-based integration, which entails…

A

Using software applications to retrieve and integrate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Scientists may perform data integration using a uniform data access, which entails…

A

Providing a consistent view of data from diverse sources without alteration - i.e. the cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Scientists may perform data integration using a common data store or data warehouse, which entails…

A

A system where data from other sources is collected and stored as a duplicate, often for data analysis, presenting that data uniformly via some kind of graph or table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data cleaning?

A

The process of detecting and removing corrupt/inaccurate records in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is missing data?

A

Data we expect to have, but is absent in our dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What can cause missing data? (pick two)

A

Human error, data type errors, incompatibility, lost records, failure to fetch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is data missing completely at random (MCAR)?

A

Data is missing by pure chance, meaning the probability of a missing value is equal for all units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data missing at random (MAR)?

A

Some data is more likely to be missing, meaning the probability of a missing value is related to the observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data missing not at random (MCAR)?

A

Data is known to have missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The keep as-is approach to handling missing values involves…

A

Keeping the data as it is given

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The remove rows approach to handling missing values involves…

A

Removing the observations with missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The remove columns approach to handling missing values involves…

A

Removing the features with missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The impute values approach to handling missing values involves…

A

Estimating and imputing missing values via models and problem knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can we generate imputable values? Describe one general solution, and one for both MNAR and MCAR.

A

Central tendency (mean, median, mode), regression analysis (MNAR), interpolation (MCAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are outliers?

A

Data points that differ significantly from other observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What can cause outliers?

A

Data errors, outstanding legitimate values, or fraudulent entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can we detect outliers?

A

Quartiles, such as lower and upper quartiles, alongside interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The do nothing approach to handling outliers involves…

A

Doing nothing to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
The replacing with upper/lower cap approach to handling outliers involves...
Replacing the data with the upper/lower limit
26
The log transformation approach to handling outliers involves...
Applying a logarithm to the outlier data
27
The removing data objects approach to handling outliers involves...
Deleting any observations with outliers
28
What are errors?
A discrepancy or deviation in the data from the actual values
29
What are random errors?
Unavoidable fluctuations in the data e.g. extraneous noise
30
What are systematic errors?
Repeatable errors that can be associated with a cause
31
What is standardisation?
Rescaling data to have a mean of 0 and standard deviation of 1.
32
What is normalisation?
Rescaling data to have a common scale - typically between 0 and 1.
33
To address skew in the data, the most efficient data transformation is...
Log transformation
34
What is binary coding?
Categorical data is converted into separate columns, with their presence indicated by a binary Boolean value
35
What is ranking transformation?
Categorical data is converted to a number
36
What is attribute construction?
Categorical data is converted to a real-world numerical version - e.g. number of years someone has been in education, so 'high school' = 12, 'bachelors' = 14, etc.
37
What is discretization?
Continuous data is converted to a discrete or finite version - e.g. height and weight converted to BMI, then stored as BMI groups
38
Smoothing is used to...
Eliminate noise or fluctuation in the data
39
How is a moving average used for smoothing?
Different sections of the dataset are taken, from which a local moving average is sampled and stored
40
What is a bar chart used for?
Categorical data with a frequency - e.g. the number of people who say that a fruit is the tastiest
41
If a bar chart becomes too cluttered, how can we resolve this?
Horizontal bar chart, which can be used to add an extra dimension to the data - e.g. male vs female responses
42
What is a line plot used for?
Visualising change over time in a continuous category
43
What is a scatter plot used for?
Visualising relationships between two variables
44
What is one way we can add another variable to a scatter plot?
Size of scatter points, colour of scatter points, visual representation
45
What is a pie chart used for?
Visualising proportionality of data
46
What are heatmaps used for?
Visualising frequency of something over time or at different points
47
What is dimensionality reduction?
A technique to remove some of the dimensions from the data, to allow the model to focus only on important data and to remove noise
48
What is feature selection?
The removal or filtering of features from the dataset that are redundant or unnecessary for prediction
49
What is a filter method of feature selection?
Filtering out features using some form of metric
50
What is a wrapper method of feature selection?
Applying a search to the dataset, looking for redundant features
51
What is an embedded method of feature selection?
Methods that are embedded into the working of the model like regularisation or decision tree pruning
52
What is variance thresholding in feature selection?
A filter method where we remove features with low variance, since they likely contain little information
53
What is a forward search in feature selection?
A wrapper method that creates a set of models that each have one feature, selecting the best one, then creating another set without that model, repeating iteratively until we have a set of features
54
What is a decision tree in feature selection?
By creating a decision tree, we leave behind features with impure leaves, leaving only the most efficient features in the decision process
55
What is feature extraction?
A method of extracting useful combinations of features that can be mixed together to produce more powerful representations
56
What is a linear method of feature extraction?
Methods with linear activation functions
57
What is a non-linear method of feature extraction?
Methods with non-linear activation functions
58
What is Principal Component Analysis (PCA) in feature extraction?
A linear method where we find an orthogonal coordinate transformation such that every new coordinate is very important
59
How does Principal Component Analysis (PCA) extract new features?
We graph our dataset, given that each feature is a new dimension, and find some transformation that reduces the dimensionality of our data while improving accuracy
60
What are t-SNE and UMAP?
Non-linear dimensionality reduction methods that aren't suited for the learning process, only data visualisation
61
How do t-SNE and UMAP aid in data visualisation?
They take the distribution of distances between each point in the dataset, and scatter them along 2 or 3 dimensions randomly, adjusting them iteratively until the distribution resembles D
62
What difference is there between t-SNE and UMAP?
UMAP is much faster, adjusting each step slightly to reduce memory and time consumption.