Week 1: Data and KNN Flashcards

1
Q

What is machine learning approach?

A

Programming an algorithm to automatically learn from data, or from experience, uncover patterns in data, building autonomous agents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What should be emphasized in machine learning?

A
  • Predictive performance
  • Scalability
  • Autonomy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why might you want to use a learning algorithm?

A
  • Hard to code solution by hand (vision, speech)
  • System needs to adapt to a changing environment (spam detection)
  • Want the system to perform better than human programmers
  • Privacy/ fairness (ranking search results)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does machine learning perform compared to humans?

A

It may perform better or worse than humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define artificial intelligence

A
  • A subfield of CS that refers to computer programs that can solve problems humans are good at
  • E.g vision, natural language
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define machine learning

A

A subfield of AI focused on learning (tuning parameters) from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define neural networks

A

Parametric model used in ML loosely based on biological neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is deep learning?

A

Neural networks with multiple layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data science?

A

An emerging field which applies ml techniques to domain-specific problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some machine learning domains?

A
  • Computer vision
  • Speech recognition
  • Natural Language Processing
  • Recommender system
  • Games
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of machine learning

A
  • Supervised learning
  • Semi-supervised learning
  • Reinforcement learning
  • Unsupervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is supervised learning

A
  • They have labeled examples of the correct behavior
  • Predict unknown values of the data using other known data
  • Classification (is this A or B?)
  • Anomaly detection (is this weird?)
  • Regression (how much/ how many)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is semi-supervised learning

A

Utilizes both labeled and unlabeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is reinforcement learning

A

Learning system which interacts with the world and learns to maximize a scalar reward signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is unsupervised learning

A
  • No labeled examples, instead looking for interesting patterns in the data
  • Find human interpretable and previously unknown patterns that describe the unlabeled data
  • Clustering (how is data organized)
  • Association rule mining (are these related?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is machine learning so powerful nowadays?

A
  • Abundance of data
  • Computing power
17
Q

What is the machine learning problem?

A
  • Should I use ml on this problem?
  • Gather and organize data (pre-processing, cleaning, visualizing)
  • Establish a baseline
  • Choosing a model
  • Optimization
  • Hyperparameter search
  • Analyze performance and mistakes
    -Iterate back to step 4 or 2
18
Q

What is data?

A

Collection of objects and their attributes

19
Q

What does a ml training set consist of?

A
  • Inputs (vectors)
  • Labels
20
Q

Why do we use input vectors in machine learning?

A
  • Algorithms need to handle lots of data
  • A common strategy is mapping data to another space that is easy to manipulate (Representation)
  • Vectors are a good representation since we can do linear algebra
21
Q

What is regression and classification in a training set?

A

Regression- t is a real number
- Classification- t is an element of a discrete set

22
Q

What are the classification metrics for evaluation?

A

Accuracy= # correct predictions/ # test instances
Error= 1 - accuracy= # incorrect predictions/ # test instances

23
Q

What is similarity?

A
  • The simplest method of learning we know
  • Classifying according to similar objects you’ve seen
  • aka manohorse
24
Q

What happens when more data points come in to nearest neighbor?

A

More complicated boundaries are possible

25
Q

What is nearest neighbors relationship with noise?

A

It is sensitive to noise or mislabeled data (class noise)

26
Q

What is the solution to noisy data?

A

Have k-nearest neighbors vote and pick the majority

27
Q

What are the steps for k-nearest neighbors?

A
  • Calculate the distance between the new data point and all the datapoints in the set
  • Identify the k points with the shortest distance to the new point, these are the k nearest neighbors
  • Among the nearest neighbors, count how many points there are for each class type and pick the majority
28
Q

What is k?

A

K determines the tradeoff between fitting the data and overfitting the data

29
Q

What happens when there is a small k?

A
  • Good at capturing fine-grained patterns
  • May overfit, sensitive to local variations in training data
30
Q

What happens when there is a large k?

A
  • Makes stable predictions by averaging over lots of examples
  • May underfit because model is too generalized and oversimplifies underlying patterns in the data
31
Q

How do you balance k?

A
  • Optimal k depends on the number of datapoints (n)
  • As a rule of thumb, choose k=3
  • k < root n
32
Q

What is validation set used for?

A

Tuning hyperparameters

33
Q

What is cross validation?

A

Used to estimate generalization error of a learning algorithm when the given dataset is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error

34
Q

What is k-fold cross validation?

A
  • A partition of dataset is formed by splitting it into k non-overlapping subsets
  • Estimate the test error by taking the average test score across k trials
  • On trial i, the i-th subset is the test set, the rest is training set
35
Q

What are the highlights of k-nearest neighbor?

A
  • Simple
  • No training
  • Easy to justify classification to customer
  • Can easily do multiclass
36
Q

What are the limitations of KNN? Large dataset

A
  • Lazy learning technique
  • in training phase KNN doing nothing, so training is fast
  • in time of prediction it becomes slow as large dataset comes since model has to calculate Euclidean distance from given point to all points in the dataset
37
Q

What are the limitations of KNN? Curse of Dimensionality

A
  • feature space becomes increasingly sparse as the number of dimensions (features) grows
  • In high-dimensional spaces, the notion of proximity or similarity becomes less meaningful
38
Q

What are the limitations of KNN? Imbalanced dataset

A
  • the majority class typically has significantly more samples than the minority class.
  • large number of neighbors from the majority class can overpower the neighbors from the minority class
  • dominate the decision making process, leading to a bias towards the majority class in the predictions.