Data Integration, Learning from data, Supervised Learning Flashcards

1
Q

Define Data Integration… Why is it needed?

A

Process of combining data from heterogenous sources into a single, coherent data store.

Data sources are usually disparate and siloed. Data integration enables the access and interpretation of data from different sources and types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 5 main ways of integration data? Describe each…

A

Common User Interface : Manual data integration by a data manager from retrieval to presentation.

Middleware Data Integration : A piece of Middleware that facilitates integration between systems. Usually legacy and new systems.

Application-Based Integration : A Software Application that locates, retrieves and integrates data into storage. Essentially, conducting the entire process, as opposed to Middleware.

Uniform Data Access : Provides a consistent view of data from a variety of sources, but doesn’t retrieve or manipulate the data.

Common Data Storage : E.g a Data Warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For each type of Data Integration process, give a pro and a con…

A

Common User Interface :
Pro = Total control and handling.
con = Poor scaling.
Middleware Data Integration :
Pro = Automated integration.
con = Must be maintained.
Application-Based Integration :
Pro = Automated end to end process.
con = Complex setup.
Uniform Access Integration :
Pro = Low storage requirements.
con = Hosts struggle w/ data request count.
Common Data Storage :
Pro = Reduces burden on host system.
con = Increased storage costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 3 categories of learning from data? Define each…

A

Supervised : Learning that has an Input set and Output set. The goal is to establish the mapping function that gives the most precise continuous target or outcome.

Unsupervised : Learning in which we only have input, and we are tasked with making sense of it.

Semi-supervised :

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the most common type of learning?

A

Supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 2 types of Supervised Learning?

A

Regression
Classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Regression…

A

The process of finding a continuous target or outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Classification…

A

The process of classifying inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In Supervised Learning, what are we trying to find?

A

The mapping function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the inputs of a supervised learning model?

A

Features, covariates, predictors etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the outputs of a supervised learning model?

A

Target, label, response etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of a usage of Supervised Learning. Define the Inputs and Outputs.

A

An input set of dog photos, a boolean output set, and a model that predicts whether each photo is a dog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define Unsupervised Learning…

A

The process of making sense of a data set by recognising patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define a Regression Problem and give an example…

A

A problem in which we need to find a continuous target or outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define what is meant by inputs, outputs and parameter variables of a Mapping Function…

A

Inputs : Input value
Parameters : The values that will change as the model learns from the data.
Output : The predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Regarding the a data frame in linear regression, what does a column represent? And what does a row represent?

A

Each column is a Feature of the input data e.g join data, monthly price etc.

Each row is an Observation of the customer.

17
Q

Regarding the input and output data frames, what is the difference in their number of rows and columns?

A

The number of rows will always be the same, N, if N is the count of input data.

The number of columns in the output Y will be 1. The number of columns in input X will vary depending on features.

18
Q

Define what parameters are in a mapping function.

A

Values that will change over time to give us the most accurate regression line.

19
Q

What makes good parameters?

A

When the parameter values give us the most accurate model.

20
Q

Define Hyper-Parameters…

A

Parameters that we select as the model progresses. These are not learned from the data.

21
Q

What are the 2 phases of learning the parameters? Define each…

A

Training phase : Use past data to find quality parameters. The more past data used, the better the parameters.

Prediction phase : Run new data into our model, and assess accuracy via a Loss Function.

22
Q

What is a Loss Function and when is it used?

A

A Loss Function is used in the Prediction Phase of Machine Learning.

Its purpose is to output a value that represents how close our models output value is to the actual value. Parameters are then updated depending on the Loss Functions output.

23
Q

What is the reason for updating the parameters after the prediction phase? What determines the values to update to?

A

To improve the prediction accuracy of the model.

The Loss Function determines the best parameters to update to.

24
Q

Give a simple Linear Regression function…

A

yb(x) = b0 + b1x + e

25
Q

In order to find the linear regression value, what 2 properties of the linear regression line do we need to know?

A

Slope of the line.
Y-intercept.

26
Q

What is the equation of finding the Y-intercept of the simple regression line?

A

b0 = y median - (b1 * x median)

27
Q

What is the equation to find the slop of the simple linear regression line?

A

b1 = r * ( sy / sx )

28
Q

Define the Pearson Correlation…

A

A measure of strength of the linear relationship between 2 samples.