What is the goal of machine learning?
Find the pattern, apply the pattern.
The goal: filter useful information from great quantiites of data vt kearing from known examples to find a pattern in the data. Determine structure or generate forecasts without human intervention.
What is supervised machine learning?
In supervised machine learning, what characteristics of the target variable determine whether a problem is classified as regression or classification?
What is unsupervised machine learning and what kinds of problems are appropriate here?
Clustering: Groups data points into clusters based on similarity
Dimensionality Reduction: Reduces the number of features while preserving important information
What is deep learning and when is it used?
Deep learning is a subset of machine learning that uses artificial neural networks with many layers (hence “deep”) to learn patterns from data.
Inspired by how the human brain works.
Learns directly from raw data (like images, text, or sound).
Automatically extracts features—no need for manual feature engineering.
What is generalisation?
Generalisation refers to a model’s ability to perform well on new, unseen data—not just the data it was trained on.
A well-generalised model captures the underlying patterns in the data.
It avoids being too specific to the training data.
The goal of machine learning is to build models that generalise well.
Example:
If you train a model to predict house prices using data from London, and it performs well on new data from Manchester, it has good generalisation.
What is overfitting?
Overfitting happens when a model learns the noise and random fluctuations in the training data instead of the true patterns.
It performs very well on training data, but poorly on new data.
It’s like memorising answers for a test rather than understanding the material.
Signs of Overfitting:
High accuracy on training data
Low accuracy on validation/test data
Complex models with too many parameters
Example:
A model that perfectly predicts house prices in your training set but fails to predict prices for new listings is overfitting.
What are the 3 sources out of sample error can originate from?
out of sample error = bias error + variance eror + base error
What does high bias look like on a graph?
High Bias (Underfitting)
In-sample accuracy is low and flat.
Out-of-sample accuracy is also low and doesn’t improve with more data.
The model is too simple to learn the pattern.
What does high variance error look like on a graph?
High Variance (Overfitting)
In-sample accuracy is very high.
Out-of-sample accuracy starts low and improves slowly.
The model memorises training data but struggles to generalise.
What does a robust model look like?
Good Generalisation
Both in-sample and out-of-sample accuracy improve steadily.
The gap between the two narrows as training size increases.
This is the ideal learning behaviour.
What is the holdout sample problem and what is used to solve it?
A holdout sample is a portion of your dataset that you set aside to test your model after training it. The holdout sample problem refers to the risk and limitations of relying on just one split of the data
K-Fold Cross Validation rotates the holdout set across different parts of the data:
Every data point gets to be in the test set once.
Every data point gets to be in the training set k−1 times.
You get k performance scores, which you average for a more reliable estimate.
How does a k fold validation work?
Step 1: Randomly shuffle the dataset
Before splitting the data, you shuffle it randomly.
This ensures that the data is mixed well, so each fold is likely to be representative of the overall dataset (not grouped by time, location, or category).
✅ Why this matters: Without shuffling, you might accidentally split the data in a biased way — for example, all London assets in one fold, all Amsterdam assets in another.
Step 2: Split the data into k folds
Divide the shuffled data into k equal-sized parts (called folds).
If you have 100 data points and choose k = 5, each fold will have 20 points.
Step 3: Train and validate k times
For each of the k iterations:
Hold out one fold as the validation set.
Train the model on the remaining k−1 folds.
Evaluate the model on the validation fold and record the performance.
Each fold gets to be the validation set once.?
Step 4: Average the results
After all k iterations, you average the performance scores (e.g., accuracy, RMSE).
This gives you a more reliable and stable estimate of how your model performs on unseen data.
What are the benefits of a k fold cross validation?
What are the 6 exmaples of supervised machine learning algos?
What is LASSO?
It’s a type of penalised regression that:
Adds a penalty based on the absolute values of the coefficients. The penalty increases with the number of features included. Think of it like adjusted r2.
Can shrink some coefficients to exactly zero, effectively removing those variables from the model.
Imagine you’re building a model to predict asset value using 20 features (e.g., location, size, tenant type, lease length, etc.).
Some features might be irrelevant or redundant.
LASSO helps by automatically selecting the most important ones.
It does this by penalising large coefficients, and zeroing out the ones that don’t help much.
Why is LASSO useful?
Why is LASSO useful?
Feature selection: It simplifies the model by removing unimportant variables. LASSO automatically performs feature selection since it eliminates the least essential features from the model.
Prevents overfitting: Especially helpful when you have more features than observations.
Improves interpretability: You end up with a cleaner, more focused model i.e. parsimonious models (fewer predictor variables) given all features need to add an adequate contribution
What is lambda and what effect does it have on LASSO?
Lambda is the tuning parameter that decides how much we want to penalize the flexibility of our model. As the value of λ
rises, the value of coefficients reduces and thus reducing the variance, consequently avoiding overfitting.
What is regularisation?
Regularisation is a technique used to prevent overfitting in machine learning models. Overfitting happens when a model learns the training data too well — including its noise — and performs poorly on new, unseen data.
Regularisation helps by penalising complexity, encouraging the model to be simpler and more generalisable.
What is a support vector machine (SVM) and how does it work?
A Support Vector Machine (SVM) is a tool that learns from examples to classify data into categories. You start by giving it examples from two groups. It then finds the best way to separate them using a straight line (or a flat surface called a hyperplane). This line is placed to create the widest possible gap between the groups, making it easier to decide where new examples belong. This method works best when the data can be clearly separated with a straight line — known as linear separability.
What is a hyperplane?
Hyperplane: A decision boundary that separates different classes in the data.
What might a SVM be used for? Provide an investment related example.
SVMs can be used for:
Credit scoring: Classifying borrowers as likely to default or not.
Fraud detection: Identifying unusual patterns in transaction data.
Market prediction: Predicting stock price movements or asset returns.
Portfolio optimization: Classifying assets based on risk-return profiles.#
E.g. Blue Dots: Represent Bull Market conditions — high momentum, low volatility.
Red Dots: Represent Bear Market conditions — low momentum, high volatility.
Asset A: High growth, low dividend → Growth portfolio
Asset B: Stable returns, high dividend → Income portfolio
Transaction 1: Normal amount, usual location → Not fraud
Transaction 2: Large amount, unusual location → Fraud
What is KNN and how does it work?
KNN classifies a new data point based on how its neighbors are classified. It looks at the ‘K’ closest points (neighbors) in the training data and assigns the most common label among them.
You have a dataset with labeled examples (e.g., stocks labeled as “rising” or “falling”).
You choose a value for K (e.g., 3 or 5).
For a new data point, KNN:
Measures the distance (usually Euclidean) between the new point and all existing points.
Finds the K closest points.
Assigns the label that is most common among those K neighbors.
What are the main challenges of KNN?
The main challenge facing KNN is the definition of “near.” Additionally, an important decision relates to the distance metric used to model nearness because an inappropriate measure generates poorly performing models. More subjectivity may arise depending on the correct choice of the correct distance measure.