Week 1: Introductions – Organisation & ML Basics Flashcards
(11 cards)
Describe unsupervised, semi-supervised and supervised
Unsupervised:
- Unlabelled data
- Understand relationships between the features
- Find correlations
Semi-supervised:
- a combination of labeled and unlabelled data.
Supervised:
- Classify or regression
- Optimise cost
Define what is the goal when doing classification
We want to learn ad decision boundary.
A function f(X, theta) = if some condition i met, then predict class A, if not B
Where the loss is defined as:
L(θ) is the average of all the individual losses for each training point. Each individual loss compares the true target with the class our model predicted (what our decision boundary f(x;θ) said) for each data point.
L(θ) = 1/N * sum of l(y_true_n, f(data point_n, model parameters)
What is regression?
Predict output for input
Translation vs Transcription - what’s the difference?
Transcription: Convert unstructured input to text (audio → English text)
Translation: Convert one language to another (English text → Spanish text)
What is anomaly detection
To detect whether something is in the data is unusual. This could be for traffic checking.
What is “Structuring/Compression” in ML?
Re-organize data with respect to relationships between elements, PCA is an example
Name some reasons why we do data exploration
- Central tendencies
- Basic measures of shape & dispersion
- Structure/patterns in the input data
- Achieve human understanding!
- Capture all your data well
- Complete labelling?
- Check for missing values
- Clean data if sensible
What is “identical representation” in ML data preprocessing?
All data samples must have the exact same structure/dimensions so they can be:
Stacked into tensors (mathematical arrays)
Processed in parallel as batches
Fed through the same model architecture
Example: All images must be 32×32×3, all text sequences must be 500 tokens, etc.
What does “Cut or pad your data” mean in preprocessing?
Resize/crop larger data to standard size
Pad: Add zeros/empty space to smaller data
Goal: All inputs have identical dimensions for tensor operations
What are the key supervised learning evaluation metrics and their formulas?
Confusion Matrix: TP, TN, FP, FN
Accuracy = (TP + TN)/(P + N) - Overall correctness
Precision (PPV) = TP/(TP + FP) - “Of predicted positives, how many were right?”
Recall (TPR) = TP/(TP + FN) - “Of actual positives, how many did we catch?”
F1-Score = 2·(Precision·Recall)/(Precision + Recall) - Harmonic mean
FP-rate (FPR) = FP/N - False alarm rate
ROC/AUC - Trade-off curve between TPR vs FPR
Please identify key
aspects of ML
Define task, represent your data, select metrics, develop your model