Fundamentals Flashcards
Keeps fundamentals of machine learning on your tips. (35 cards)
Training Set
The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample).
Why Machine Learning?
- The ML programs are much shorter, easier to maintain, and most likely more accurate.
- The ML program learns automatically.
- ML solves problems that are either too complex for traditional approaches or have no known algorithm.
- ML can help humans learn.
Machine Learning is great for?
- Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
- Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
- Fluctuating environments: a Machine Learning system can adapt to new data.
- Getting insights or finding patterns into complex problems and large amounts of data.
Types of Machine Learning Systems?
- Whether or not they are trained with human supervision (supervised, unsupervised, semi-supervised, and Reinforcement Learning)
- Whether or not they can learn incrementally on the fly (online versus batch learning)
- Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)
What are typical supervised learning tasks?
- Classification (to group into categories)
- Regression (to predict a target value)
- Anomaly Detection (to detect the outliers)
- Association rule learning (to discover interesting relationships between attributes)
What is the difference between attribute and feature?
An attribute is a data type and feature means an attribute and its value.
Which are the most important supervised learning algorithms?
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks
Which are the most important unsupervised learning algorithms?
Clustering (groups similar data) - k-Means - Hierarchical Cluster Analysis (HCA) - Expectation Maximization Visualization (plots 2D or 3D representations) and dimensionality reduction (data simplification) - Principal Component Analysis (PCA) - Kernel PCA - Locally-Linear Embedding (LLE) - t-distributed Stochastic Neighbor Embedding (t-SNE) Association rule learning - Apriori - Eclat
What is feature extraction?
Merging of similar features into one without sacrificing accuracy.
When should you use online learning algorithms?
- When you need a reactive system e.g. stock price predictor.
- When autonomous learning is needed e.g. rover on Mars.
- When resources are limited e.g. smartphone app.
What is the learning rate?
One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.
What is instance-based learning?
The system learns the examples by heart, then generalizes to new cases using a similarity measure.
What is model-based learning?
It’s the way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions.
How do you define the performance measure of your algorithms?
You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is.
What is the lifecycle of a typical ML project?
- You study the data.
- You select a model.
- You train it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function).
- Finally, you apply the model to make predictions on new cases (this is called inference), hoping that this model will generalize well.
What are the main challenges of ML?
- Insufficient quantity of training data. It takes a lot of data for most ML algorithms to work properly.
- Non-representative training data. If the sample size is too small you can have sampling noise. If the sampling method is flawed you can have a sampling bias.
- Poor quality data. If your training data is full of errors, outliers, and noise, it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.
- Irrelevant features.
- Overfitting the training data.
- Underfitting the training data.
What is overfitting?
The model performs well on the training data, but it does not generalize well.
What is feature engineering?
Feature engineering is coming up with a good set of features to train on. It involves:
- Feature selection: selecting the most useful features to train on among existing features.
- Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).
- Creating new features by gathering new data.
What is regularization?
Constraining the parameters of the learning model to prevent it from overfitting.
What are the solutions for overfitting?
- To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model
- To gather more training data
- To reduce the noise in the training data (e.g., fix data errors and remove outliers)
What is regularization hyperparameter?
Hyperparameter is the parameter of the learning algorithm. It is applied to control the amount of regularization during learning.
What is underfitting?
It occurs when your model is too simple to learn the underlying structure of the data.
How do you fix underfitting?
- Selecting a more powerful model, with more parameters
- Feeding better features to the learning algorithm (feature engineering)
- Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)
What is machine learning?
Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.