Week 1: Data and KNN Flashcards

(38 cards)

1
Q

What is machine learning approach?

A

Programming an algorithm to automatically learn from data, or from experience, uncover patterns in data, building autonomous agents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What should be emphasized in machine learning?

A
  • Predictive performance
  • Scalability
  • Autonomy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why might you want to use a learning algorithm?

A
  • Hard to code solution by hand (vision, speech)
  • System needs to adapt to a changing environment (spam detection)
  • Want the system to perform better than human programmers
  • Privacy/ fairness (ranking search results)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does machine learning perform compared to humans?

A

It may perform better or worse than humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define artificial intelligence

A
  • A subfield of CS that refers to computer programs that can solve problems humans are good at
  • E.g vision, natural language
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define machine learning

A

A subfield of AI focused on learning (tuning parameters) from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define neural networks

A

Parametric model used in ML loosely based on biological neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is deep learning?

A

Neural networks with multiple layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data science?

A

An emerging field which applies ml techniques to domain-specific problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some machine learning domains?

A
  • Computer vision
  • Speech recognition
  • Natural Language Processing
  • Recommender system
  • Games
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of machine learning

A
  • Supervised learning
  • Semi-supervised learning
  • Reinforcement learning
  • Unsupervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is supervised learning

A
  • They have labeled examples of the correct behavior
  • Predict unknown values of the data using other known data
  • Classification (is this A or B?)
  • Anomaly detection (is this weird?)
  • Regression (how much/ how many)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is semi-supervised learning

A

Utilizes both labeled and unlabeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is reinforcement learning

A

Learning system which interacts with the world and learns to maximize a scalar reward signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is unsupervised learning

A
  • No labeled examples, instead looking for interesting patterns in the data
  • Find human interpretable and previously unknown patterns that describe the unlabeled data
  • Clustering (how is data organized)
  • Association rule mining (are these related?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is machine learning so powerful nowadays?

A
  • Abundance of data
  • Computing power
17
Q

What is the machine learning problem?

A
  • Should I use ml on this problem?
  • Gather and organize data (pre-processing, cleaning, visualizing)
  • Establish a baseline
  • Choosing a model
  • Optimization
  • Hyperparameter search
  • Analyze performance and mistakes
    -Iterate back to step 4 or 2
18
Q

What is data?

A

Collection of objects and their attributes

19
Q

What does a ml training set consist of?

A
  • Inputs (vectors)
  • Labels
20
Q

Why do we use input vectors in machine learning?

A
  • Algorithms need to handle lots of data
  • A common strategy is mapping data to another space that is easy to manipulate (Representation)
  • Vectors are a good representation since we can do linear algebra
21
Q

What is regression and classification in a training set?

A

Regression- t is a real number
- Classification- t is an element of a discrete set

22
Q

What are the classification metrics for evaluation?

A

Accuracy= # correct predictions/ # test instances
Error= 1 - accuracy= # incorrect predictions/ # test instances

23
Q

What is similarity?

A
  • The simplest method of learning we know
  • Classifying according to similar objects you’ve seen
  • aka manohorse
24
Q

What happens when more data points come in to nearest neighbor?

A

More complicated boundaries are possible

25
What is nearest neighbors relationship with noise?
It is sensitive to noise or mislabeled data (class noise)
26
What is the solution to noisy data?
Have k-nearest neighbors vote and pick the majority
27
What are the steps for k-nearest neighbors?
- Calculate the distance between the new data point and all the datapoints in the set - Identify the k points with the shortest distance to the new point, these are the k nearest neighbors - Among the nearest neighbors, count how many points there are for each class type and pick the majority
28
What is k?
K determines the tradeoff between fitting the data and overfitting the data
29
What happens when there is a small k?
- Good at capturing fine-grained patterns - May overfit, sensitive to local variations in training data
30
What happens when there is a large k?
- Makes stable predictions by averaging over lots of examples - May underfit because model is too generalized and oversimplifies underlying patterns in the data
31
How do you balance k?
- Optimal k depends on the number of datapoints (n) - As a rule of thumb, choose k=3 - k < root n
32
What is validation set used for?
Tuning hyperparameters
33
What is cross validation?
Used to estimate generalization error of a learning algorithm when the given dataset is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error
34
What is k-fold cross validation?
- A partition of dataset is formed by splitting it into k non-overlapping subsets - Estimate the test error by taking the average test score across k trials - On trial i, the i-th subset is the test set, the rest is training set
35
What are the highlights of k-nearest neighbor?
- Simple - No training - Easy to justify classification to customer - Can easily do multiclass
36
What are the limitations of KNN? Large dataset
- Lazy learning technique - in training phase KNN doing nothing, so training is fast - in time of prediction it becomes slow as large dataset comes since model has to calculate Euclidean distance from given point to all points in the dataset
37
What are the limitations of KNN? Curse of Dimensionality
- feature space becomes increasingly sparse as the number of dimensions (features) grows - In high-dimensional spaces, the notion of proximity or similarity becomes less meaningful
38
What are the limitations of KNN? Imbalanced dataset
- the majority class typically has significantly more samples than the minority class. - large number of neighbors from the majority class can overpower the neighbors from the minority class - dominate the decision making process, leading to a bias towards the majority class in the predictions.