# Machine Learning Basics Flashcards

## Taken from various sources inc: https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf

What is Machine Learning?

The art and science of giving computers the ability to learn to make decisions from data (improve at a task based on experience) without being explicitly programmed.

Name three types of Machine Learning categories

Supervised Learning

Unsupervised Learning

Reinforcement Learning

What is Unsupervised Learning?

Uncovering hidden patterns from unlabelled data e.g. grouping customer into distinct categories (clusters) that were unknown before and are hopefully meaningful

What is Reinforcement Learning

Software agents interact with an environment and try to find the most efficient pathway to a goal, learn how to optimise behaviour.

Given a system of rewards and punishments. Sucessful routes get reward, failures are restarted.

What is Supervised Learning?

The Machine Learning model trains on sets (variables/features) of labelled training data (target variable) then predicts the labels (target variable) of subsequent testing datasets for unseen, often through multiple iterations.

Also called Predictive Data Analytics.

What is Exploratory Data Analysis (EDA)?

Data analysis which performs initial explore of data, using mostly graphical techniques, to gain insight into nature of data and structure. What are important variables and outliers.

Who codified EDA practice?

John Tukey in 1970s

Complete the following …“Science does not begin with a tidy question …” (EDA)

“… nor does it end with a tidy answer”

What did Tukey refer to EDA as?

Detective work

Name some EDA techniques

Plot data norms: level of distribution, measures of central tendency: mode, median, mean

Range of spread of distribution: Standard Deviations, Percentiles, Quartiles

Relationships between variables/features in datasets/observations

Investigate trends for variables over time.

Describe Data Wrangling

A process that occurs during the Data Preparation stage.

Take messy, incomplete data or data that is too complex and simplify and/or clean it so that it’s useable for analysis

Remove or impute missing values Convert categorical to numeric Standardise/Normalise data Clean data Join data together Generate new fields

Overlap with Feature Engineering

What is Feature Engineering?

Taking whatever information you have about your problem and turning it into a usable numeric format that you can use to build your feature matrix.

How does Machine Learning work?

Use data to form a hypothesis, new data exposed errors in your hypothesis so the error gap is measured and hypothesis is adjusted to fit. Aim to get the error gap as low as possible.

Name some types of Feature Engineering.

Converting Categorical features to numeric - could use one-hot encoding

Encode images to pixel representation

Impute missing data - fill Nan with mean of column

Build Feature Pipeline to chain together above tasks

What do Machine Learning Algorithms do?

Algorithms learn a pattern inherent in existing data. These patterns can be used to make predictions about data that has not yet been analysed. This pattern, or model, is much smaller than the training data.

Describe the Machine Learning lifecycle.

Derive pattern/data model using training data and algorithm

Check model using test data

Use formal process to check accuracy of model

Apply model to new data

What is Dimensional Reduction in terms of Feature Engineering?

No of dimension too high = too long to process data/produce model

Some dimensions may not be of use

Can either just throw away dimension - use intuition

Employ Dimension Reduction techniques: Decision Trees, Principal Component Analysis (PCA)

What is Principal Component Analysis (PCA)?

PCA is a feature extraction technique for reducing the dimension of a feature space (curse of dimensionality), so that there are fewer relationships between features to consider, and => less likely to overfit model.

What is Clustering in terms of Unsupervised learning?

Finding islands of similarity in complex data sets.

Uniting singular points into distinct groups or clusters.

Examining data and assembling data points into sluters based on a measure of distance.

Describe the K-Means Clustering algorithm

Unsupervised method utilising clustering.

- Choose No of clusters (K) to be used by algorithm (Scree plot)
- Randomly plot K cluster centre points as start position
- Assign each point to nearest centriod
- Update position of centriods to reflect new centre/average location of data points
- Repeat 3+4 until no new data assignment occurs.