Machine Learning + Supervised Flashcards by Peter Oliver

Who first defined machine learning and when?

Arthur Samuel 1959

How well did you know this?

Not at all

Perfectly

Define machine learning.

Field of study that gives computers the ability to learn without being explicitly programmed.

How well did you know this?

Not at all

Perfectly

How did Arthur Samuel implement machine learning?

Automated checkers programmed that learnt positions so could re-use them instead of searching manually each time.

How well did you know this?

Not at all

Perfectly

How does machine learning operate in a nutshell?

Machine receives answers and data as input and outputs rules.

How well did you know this?

Not at all

Perfectly

Name the steps of machine learning. x6

Data Gathering
Data Pre-Processing
Choose a Model
Train the Model
Fine-Tune the Model
Apply the Model

How well did you know this?

Not at all

Perfectly

Give some more example applications of machine learning.

Any one of:
Building smart robots
Text understanding
Computer vision,
Medical Informatics
Database Mining

How well did you know this?

Not at all

Perfectly

Distinguish between supervised and unsupervised learning with examples.

Supervised: Has training data with known answers and produces model to predict for new data. E.g. Classify tumors

Unsupervised: Data provided but with no answers and algorithm finds structure or interesting patterns in data. E.g. Market Segmentation Research

How well did you know this?

Not at all

Perfectly

Distinguish between eager and lazy learning.

Eager: Constructs a general, input-independent target function during training. Model constructed before any queries.

Lazy: Generalization beyond the training data is delayed until a query is made. Stores training data with only minor processing until query.

How well did you know this?

Not at all

Perfectly

T or F. Lazy learners take less time in training and less time in predicting than eager learners.

False. Lazy learners take less time in training and more time in predicting.

How well did you know this?

Not at all

Perfectly

List relevant algorithms studied for eager / lazy respectively.

Eager:
- Decision Trees
- NaiveBayes
- Neural Networks

Lazy:
- K-nearest neighbour

How well did you know this?

Not at all

Perfectly

Timing of Model Building: Difference between Lazy and Eager Learning.

Lazy: During prediction
Eager: Before prediction.

How well did you know this?

Not at all

Perfectly

Data Dependency: Difference between Lazy and Eager Learning.

Lazy: Relies heavily on training data during prediction.

Eager: Less dependent on training data during prediction.

How well did you know this?

Not at all

Perfectly

Computational Efficiency: Difference between Lazy and Eager Learning.

Lazy: Faster in training, slower in prediction due to model building.

Eager: Slower in training, faster during prediction due to pre-built model.

How well did you know this?

Not at all

Perfectly

Memory Usage: Difference between Lazy and Eager Learning.

Lazy: Less memory usage in training, more during prediction.

Eager: More memory in training, but less during prediction.

How well did you know this?

Not at all

Perfectly

Other than supervised, unsupervised, what is the third type of machine learning?

What drives it?
Give one use.
Give one example algorithm.

Reinforcement Learning
Learning from mistakes
Gaming
Q-Learning

How well did you know this?

Not at all

Perfectly

What drives supervised and unsupervised learning respectively.

Supervised: Task driven
Unsupervised: Data driven

How well did you know this?

Not at all

Perfectly

What is regression?

Technique used in supervised learning where historical data is used to predict future.

How well did you know this?

Not at all

Perfectly

What is a decision tree?

Each node has a question, each branch possible answers, leaf nodes have decisions. E.g. Series of q’s to determine user BMI

How well did you know this?

Not at all

Perfectly

Define the following terms as parts of decision tree.
1. attribute
2. attribute value
3. classification
4. conjunctive term

Internal Node
branch
Leaf (External) Node
Path to a leaf node

How well did you know this?

Not at all

Perfectly

Outline the DT learning algorithm.

Choose best attribute
Add new node for this attribute and a new branch for each attribute value.
Sort training examples through node to current leaves.
If training examples are unambiguously classified then stop.
Otherwise repeat.

How well did you know this?

Not at all

Perfectly

T or F. Finding an optimal decision tree is tractable.

False. It is intractable.

How well did you know this?

Not at all

Perfectly

T or F. Finding a near optimal tree is tractable.

True.

How well did you know this?

Not at all

Perfectly

What is the ideal goal when making a decision tree?

That it be as small as possible.

How well did you know this?

Not at all

Perfectly

T or F. If an attribute leads to an immediate classification then it is a good attribute.

True

How well did you know this?

Not at all

Perfectly

Define entropy. What is the measurement of entropy?

Computational measure of impurity / uncertainty. Bits.

What is the relationship between probability and entropy?

Add up the result of - p log2 p for all different outcomes with probabilities p_i and this is H(X). Large probability = low entropy Small probability = large entropy

Formula for Gain(S, D).

Gain(S, D) = H(S) - (sum for all attribute-values of D) (p of attribute value) * entropy of attribute value

When do we stop decision tree making process?

When entropy = 0

What is the problem of noise? How can we fix it?

They are no attributes left, but there are both positive and negative examples remaining. Have leaf nodes report majority or probabilistic classifications.

What is the problem of overfitting? How can we fix it?

Algorithm may use irrelevant attributes to make spurious distinctions among the examples. Use statistical significance to determine whether a Gain is large enough to proceed.

What is the problem of missing data? How can we fix it?

Not all new examples will have data for every attribute. Assign average value for empty attribute or try all branches.

What is the problem of multi-valued data? How can we fix it?

Attributes with large number of possible values can non-representative gain results. E.g. Id Penalize broad uniform splits.

What is the problem of continuous variables? How can we fix it?

Building tree with continuous values from training data leads to problems as new examples are unlikely to have exact same value. Use ranges, rather than exact values.

What is the problem of costly attributes? How can we fix it?

Testing certain attributes may carry a substantial cost. Low cost might be more desirable than small tree. Solution: Incorporate cost. E.g. Gain^2 / Cost

On what are decision trees, naive bayes and KNN based on?

D: Information-Based N: Probability-Based K: Similarity Based

T or F. Naive Bayes is eager.

True

T or F. Decision Trees is eager and naive bayes is not.

False

T or F. Decision Trees is eager.

True

T or F. KNN is eager.

False

What are lazy learners also known as and why?

Instance-based learners because they store training points or instances.

What is feature-space?

An n-dimensional way to describe a dataset which n attributes (one for each dimension). All datapoints can be positioned at a unique point in this depiction.

What does a nearest neighbor algorithm do?

Seeks to identify the most similar instance in the known training set to a given query.

What is the general formula for similarity used in nearest neighbor algorithm. Assume m dimensions. What is the name of the general formula? How can Manhattan and Euclidean distance be derived from it?

Let a, b be two points in the m-dimensional feature-space. (sum_{i=1}^{m} abs(a[i] - b[i])^{p}))^{1/p} Minkowski distance M: p = 1 E: p = 2

What is a voroni tessellation?

- Shows how input space divides into classes - "local neighborhoods" across the feature space with each space define by a subset of the training data. - Each line segment is equidistant between two points of opposite classes.

Difference between 1NN and 3NN.

In 1NN, if there is an outlier, anything closest to it will be misclassified. We smooth this out by having k (e.g. 3) nearest neighbors vote.

What are the three primary requirements of knn?

Set of training instances Distance metric to measure between distances The value of k, number of nearest neighbors to retrieve.

What is the general idea of knn?

Use class labels of k nearest neighbors to determine the class label of an unknown instance.

Name for the process of choosing the right k in knn. Discuss choosing small k vs big k.

Parameter Tuning k too small -> disrupted by noise k too big -> less precise

Give two sample methods of choosing k that are good rules of thumb.

Choose odd k when even number of instances. Choose k < sqrt(n) for n instances.

Outline KNN basic classification algorithm.

- Calculate the similarity of new point to all the instances in feature-space. - Rank instances in feature-space by their similarity to new point. - Identify the set of k nearest neighbors. - Report the "majority vote" of this set.

Outline some common problems with KNN.

Scaling Issues & Missing Values: Attributes may have to be scaled to not dominate and missing values leads to misclassification. Outliers: Individual spaces which are separate from the class to which they belong. Noise. Especially for low k values causes misclassification.

T or F. Knn is supervised and lazy.

True

T or F. Decision Trees are unsupervised and eager.

False. They are supervised.

State Bayes Theorem.

P(c | x) = P(x | c)P(c) / P(x)

What are the two assumptions when dealing with multiple attributes and bayes theorem? Are these assumptions real-word and what does this mean?

1. Attributes are conditionally independent. 2. Attributes are weighted equally towards the outcome. No in all likelihood, means it is not perfectly accurate but yet it still yields very good results.

T or F. Naive Bayes performs better on categorical data compare to numerical data/

True

T or F. Naive Bayes is O(N).

True

Outline how Naive Bayes is performed.

Construct a frequency table and than a likelihood table for each attribute against the target class. Then apply formula P(c | X) = P(x1 | c) x P(x2 | c) x ... x P(xn | c) x P(c) To all cases for the target class.

How do I standardize naive bayes to have values between 0 and 1?

P(Yes) / P(No) + P(Yes) = Yes standardized. With No standardized being the opposite.

Give an application for naive bayes.

Credit scoring.

Machine Learning + Supervised Flashcards

(60 cards)