Project Flashcards

(45 cards)

1
Q

USP of our paper

A

o Identification of vivax parasite: little research but most diversified geographically & no vaccine exists yet on market <-> falciparum
o Zhao et al. (2020) tested falciparum trained model to identified vivax -> overfitting
o Binary classification task with this dataset (others only multi) -> because info about malaria existence most important with high accuracy (stage can be identified next) -> results show that CNN1 model same or outperformed multilabel classificaiton with the same BBBC dataset 98,3% (Li et al. 98,3% & Meng et al 94,17%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Importance

A

rapid diagnosis & treatment best method to prevent severe outcomes & deaths
o No vaccine
o PCR costly & lack of equipment & health personnel
o Lack of health experts in low resource countries for manual microscopy identification
-> ML = cost-efficient, less expert knowledge necessary (as machine detects) & less equipment (no PCR, but just microscope & ml tool)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Business Case

A

o Our business case focuses on leveraging machine learning to improve malaria detection, specifically targeting the P. vivax strain. With over 250 million cases and 600,000 deaths reported annually, malaria remains a significant global health challenge affecting 84 countries. By addressing the neglected P. vivax strain and utilizing advanced machine learning techniques, we aim to provide accurate and cost-effective malaria detection solutions
o Target businesses: clinics (in low resource countries), pharmaceutical companies, public healthcare agencies or NGOs, researchers
o Advantages: cost- & expert-efficient, enhanced diagnosis accuracy & research advancement
o Market: 84 countries effected by malaria (developing countries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why SVM?

A

o used in literature similar tasks, compare to other results as well
o machine learning comparison to deep learning
o Ability to handle high-dimensional data using kernel trick
(= random forest, <-> KNN, Naïve Bayes, Linear Models (logisitc & linear regression))
o Robust against overfitting (by tuning C -> regularization parameter deciding size of hyperplane margin, if high -> low training error but lower generalization)
o Effective for binary classification (due to separation by hyperplane)
o Memory efficient (image with many pixel): Only have to save support vectors for decision boundaries (<-> KNN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why CNN? Why did use CNN and not fully connected network?

A

o Need to flatten image to 1D vector. Any spatial information is lost! (Position and layout of elements in two-dimensional space), want to identify infected cell part where it doesn’t matter at what location this is exactly
o Solution: convolutional layers -> only partially connect previous layer -> reduced number of connections needed
o Ability learn & identify intricate patterns & structures
o Learn hierarchical representations
o Widely used for image classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why different CNNs?

A

o One build from scratch
o One adapted from similar classification task
o Comparing if similar task CNN will also be good for other purpose (ours, vivax parasite, this dataset) -> need for other, specifically trained models?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why scaling between 0 & 1 of the image pixels (normalizing)?

A

o Equal importance of features: prevents that certain features with large number dominate the learning process (more important for learning) -> biased towards features with larger magnitude
o Enhancing convergence speed (convergence = reaching a stable & optimal state)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How did you handle unbalanced dataset?

A

o Undersampling: randomly choosing 5.000 uninfected images
o Oversampling: ADASYNE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data Augmentation?

A

o Increasing number of training data by creating new samples by applying transformations of existing data
o We used: geometric transformations: rotation & flipping (horizontal & vertical)
o Not: scaling or cropping: losing important details of cell images -> boundary boxes close to each other
o Used: ImageDataGenerator from TensorFlow
o Only for CNN (SVM serves as baseline: beneficial to keep the model as simple as possible to have a clear foundation for comparison with other more advanced approaches) -> could have tried with data augmentation as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why Data Augmentation?

A

o more variations and diversity into the training data -> model generalizes better
o mittigating overfitting: larger & more diverse dataset -> learns the characteristics insteady of memorizing training sample -> better generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why not use Data Augmentation for SVM?

A

o Image Generator in Tensorflow not possible to implement in SVM flow
o Baseline model -> should have “true” images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What optimizer?

A

Adam: Adaptive Moment estimation
- Extension of gradient descent
- Combines two algorithms: computes first-oder momentum (moving average of gradients) and second-oder momentum (moving average of squared gradients) of loss function
-> adapts learning rate for each parameter based on its historical gradients & momentum
- GD with momentum = taking into consideration the ‘exponentially weighted average’ of the gradients -> average = converge faster
- RMSP algorithm = taking into consideration the “exponential moving average” (average change in data over time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ADAM - elements

A
  • Adaptive Learning Rate (dynamically adapts it for each parameter in network -> smooth & fast convergence)
  • Momentum (helps accelerate convergence by adding a fraction of the previous gradient’s direction to the current gradient update -> overcome local minima)
  • Bias correction (to address
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why used Adam?

A

 Used in similar study
 Pro: efficiency & robustness, converge faster, works well on noisy & sparse data
o Difference to normal stochastic GD
 Learning grate adaptive (GD: maintains single learning rate during training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  • What is a momentum?
A

o technique where term is a added to the parameter updates that accounts for previous direction of movement
o e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are different ways to choose the learning rate in gradient decent?

A

o Fixed learning rate: constant learning rate is set before training
o Learning rate schedules: adjust learning rate over time
 Step decay: Learning rate is reduced by a certain factor after a fixed number of epochs or iterations
 Exponential decay: The learning rate decreases exponentially over time.
 Performance-Based Schedules: The learning rate is adjusted based on the performance on a validation set
o Adaptive learning rates: dynamically adjust the learning rate based on the behavior of the optimization process.
 !!Adam combines adaptive learning rates with momentum. It adapts the learning rate for each parameter based on both the first-order (gradient) and second-order (gradient squared) moments of the gradients.
o Learning Rate Search: e.g. grid search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does a learning rate of 0.001 mean?

A

o represents the step size at which a machine learning algorithm updates the model parameters during training = magnitude of adjusments made to model based on calculated gradients
o hyperparameter
o 0.0001 = low learning rate
o Small step sizes, doesn’t overshoot minimum but takes long time to converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does binary cross entropy loss mean?

A

o Cost function that measure the difference between the predicted probability distribution and the true probability distribution for a binary classification problem
o calculates the average of the logarithmic loss for each instance, where a higher loss is assigned to incorrect predictions and a lower loss to correct predictions
- loss: H. (q) =-1/N * sum(yi * log(p(yi)) + (1-yi) * log(1-p(yi))
- yi = 1 or 0 true outcome
- p(yi) probability that this outcome
- log: If p = 1 0 -log(p) outcome, if p smaller larger outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is overfitting?

A

performance high on training data but low on validation set (low bias, but high variance)
o observing the learning curves reveals a pattern of overfitting when the training loss consistently decreases, but the validation loss begins to rise or remains stagnant.
o Goal: small gap between training & validation curve -> generalization good

20
Q

What done to prevent overfitting?

A
  • CNN:  Drop out layer, early stopping
  • SVM: PCA: removes noise in the data and keeps only the most important features. Less dimensions reduces the risk of overfitting.
     Grid search & cross-validation
     Look at each result of each fold of cross-validation -> if approxi. Same no overfitting
21
Q

how do you choose number of folds (cross-validation)?

A

o higher number of folds -> more accurate estimate of performance but require more computational resources and time
o Smaller datasets may benefit from a higher number of folds, while larger datasets may be adequately assessed with fewer folds
o We used 3 folds (cv=3) to reduce the CPU time needed (we could have used 5)

22
Q

What is GPU?

A

o GPU stands for Graphics Processing Unit
- is a specialized electronic circuit designed to quickly process and render graphics
- widely used for accelerating computations in areas such as ml

23
Q

How did you tune hp in CNNs? And which ones?

A

o With babysitting
o number of hidden layers, units per layer, learning rate, epochs, batch size, early stop patience

24
Q

What are epochs?

A

o Number of times entire training dataset is passed through NN during training process
o Each epoch consists of: 1 forwards, 1 backward pass (update of weights) for all training examples
o Result: NN can learn & refine parameters gradually

25
What is the activation function ReLu and why does it work well for images?
o Replaces all negative values (pixels) with 0, keep all positive values the same o ReLu are easy to compute, computationally efficient, gradients are cleanly defined and are constant (except for a piecewise non-linearity) o introduce non-linearity into the network, allowing it to learn complex patterns and gradients
26
Why PCA only for SVM
 CNN detected weights and features as deep learning  SVM only weights as machine learning
27
What are kernal functions? SVM
 To classify non linearly sepearble data  transform the images into a high-dimensional space and consequently find the optimal decision boundaries in this new high-dimensional space  Functions include: linear, polynomial, radial basis function
28
Which kernal function did we use & why?
 Radial basis function: combines multiple polynomial kernels multiple times of different degrees to project the non-linearly  Detect via grid search
29
Why grid search to select hyperparameters
* Random Search & Try-Error: needs more time, random and thus expected to provide worse hp, not structured search * PCA -> less dimensions -> grid search computationally ok
30
What are relevant hyperparameters of SVM?
 C = controls balance between maximizing margin & minimizing training error * Small -> wide „street“ seperating 2 classes but datapoints might be within boundary -> larger training error but better generalization * Large -> small „street“ -> lower trianing error but lower generalization  Gamma = influences how decision boundary adapts to different datapoints = defines how many dp considered for hyperplane * Small -> considers many dp * Large -> conisders few dp
31
What where your optimal hyperparameters? What did they tell you? SVM
 C = 100 * Relatively high -> small street, lower training error, but lower generalization  gamma = 0.0001 * relatively low -> decision boundary less adjusted to individual datapoints, considers a broader range of training instances in determining the decision boundary -> considers more instances -> more generalizable
32
Accuracy
o Accuracy: measures overall correctones of model, ratio of correctly predicted instances to total number of instances
33
Precision
o Precision: measures model ability to predict positive instances correctly -> presents ratio of false positives
34
Recall
o Recall: ratio of true positives to the sum of true positives and false negatives > presents ratio of false negative
35
Which CNN models did you choose? And what are the differences?
CNN1: self-developed model <-> CNN2: adapted from similar paper
36
Differences between CNN models
o CNN1:  padding (because of cropping)  more layers (5 conv., 5 maxpooling, 2 dropout, 2 dense)  dropout  smaller kernel size (for malaria important -> consider local features, don’t miss info)  max pooling (<-> average) (better because max focuses on most important / striking features in surrounding)  2 (<-> 3) dense layers  Dense layers with 512 units (<-> 256), more information -> capture more complex patterns & nuances in the input (also due to larger image size)  dense layers have activation function (introduction of non-linearity)
37
Potential explanation based on Differences from CNN2 paper:
o Other parasite o different images (background etc.) o no cropping o smaller images (44x44, we have 128 x 128 -> can have more pooling -> using their architecture will not work (still large images with few pooling layers)
38
Which kernel size should you use?
a. In our case: used small one (3,3) -> because malaria part can be small b. Also don’t want to risk to miss some information, pattern that indicates malaria
39
Why always relu?
o introduce non-linearity into the network, allowing it to learn complex patterns and gradients o computationally efficient o avoiding vanishing gradients o Images typically consist of pixel intensities ranging from 0 to 255 -> black = 0, dont need negative values o Non-linearity: For negative inputs (x < 0) -> output 0. This non-linear behavior introduces an element of discontinuity, breaking the linearity of the network o For positive: still linear -> can capture both non-linearity & linearity
40
What is padding? Why used? -
0 boarders around image pixels, info on boarder, we had boundary boxes -> wanted to classify well
41
Why did CNN 2 use average and max pooling and which one is better
- Max pooling better for image classification - features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map (hence, the term feature map), - more informative to look at maximal presence of different features than at their average presence - Full contrast
42
Why not KNN?
- Does not scale well: As dataset grows, KNN becomes increasingly inefficientcompromising overall model performance. (Scaling problem) - prone to overfitting - Curse of dimensionality: volume of space grows exponentially with dimensions, Need more points to ‘fill’ a high-dimensional volume
43
Why not Random Forest?
- RF better for multi-class classification - not so well with high-dimensional data compared to SVM - research showed that RF performs better if 10 to 100 features, SVM if > 100 features (paper that compares which papers used RF or SVM & how they performed) - could be a good option as well (but literature said that SVM outperformed for Malaria detection)
44
Logistic regression
- linear model that’s why can’t really work - risk of overfitting if many independent variable - limited in capturing complexity
45
In can why use dropout and early stopping?
address different aspects of overfitting * Early stopping: control the capacity of the model by stopping training before it starts overfitting * Dropout: introduces randomness during training, preventing the network from relying too heavily on any specific set of features or neurons