PROJECT Flashcards

Question 1

Q

Why you have chosen ResNet-50 and VGG16 to solve your distracted driver detection problem.

Answer

A

ResNet-50: Chosen for its depth and skip connections, which helps with efficient training and avoiding vanishing gradients. It also performed well on ImageNet, making it suitable for transfer learning in image tasks.

VGG16: Though older and more parameter-heavy, it had strong benchmark performance on similar Kaggle tasks (based on literature review) and provided a good comparison point.

Question 2

Q

What were the results from yor project?

Answer

A

VGG16 performed better than ResNet50, which was also what we saw on Kaggle.
Augmentation did not improve the models.
All of the models seem to be overfitted on the training data, but still the score on the test is ok.
We did not reach competition performance, BUT we also dont have data leakage among drivers, which was probably the reason the other groups reached that high performances

Model - Accuracy - Precision - Recall - F1-Score
ResNet-50 (no augmentation) - 0.8203% - 0.8355% - 0.8203% - 0.8180%
ResNet-50 (with augmentation) - 0.7301%- 0.8201% - 0.7301% - 0.7346%
VGG16 (no augmentation) - 0.8548% - 0.8562% - 0.8548% - 0.8498%
VGG16 (with augmentation) - 0.7550% - 0.8303% - 0.7550% - 0.7687%

Question 3

Q

What is the intuition for the success or failure of your model for a specific class in your task?

Answer

A

The model struggled with class c8 (Hair and Makeup). This class contains subtle cues (like hand near face or cosmetics), which are easily confused with c9 (Talking to passenger).

Models like ResNet may not capture these fine-grained distinctions if they aren’t well represented in the training set.

Question 4

Q

Please describe the evaluation approaches that you have chosen to identify good hyperparameters.

Answer

A

We did not have time to do an actual hyperparameter optimisation, but if we had more time, we would do a gridsearch. Most of our approaches were done through trial and error on the smaller dataset (using validation as our monitoring value).

We changed the dropout of the neurons from a probability of 20% to a probability of 40%, to avoid overfitting. This was done as a trial-and-error approach.

The reason we chose to leave 5 layers to be trainable for both models is for consistency across them. The code we were inspired by from VGG16 had only the last layers to be trainable, and the ResNet50 had 3, but we thougth for easi comparison, we would change it to 5 for all. Some relativity between the 2 values should have been taken into account (as ResNet 50 is much deeper than VGG16).

Our optimiser Adam is one of the most popular optimisation algorithms in deep learning. Adam is a robust and efficient optimization algorithm. We used a small learning rate to get more stable training.

We halved the learning rate if it plateaued in the validaiton loss. this was done to get out local minima we were stuck in

We also used early stopping, to avoid overfitting during training. we have a patience of 5, giving it 5 epochs, to try and improve further, if it can’t we stop.

Adam(learning_rate=0.001) Optimizer Default LR for Adam, balances speed and stability

Question 5

Q

Can you walk me through your entire pipeline, from data loading to model evaluation, and explain key decisions along the way?

Answer

A

Training (and preprocessing) pipeline:
1. We define the paths for our dataset splitting, then we split the data intp train, val and test, and we remember to split on the driver, so driver/subject 1 is not in both train and test, but only in one of them
2. Then we process the data, by both augmenting and normalising all the data (for augmentation, we only augment the training data)
3. We train two ResNet50 models, one with only normalised data, one with only augmented (and normalised) data. The same for VGG16.
4. We evaluate with different metrics (precision, recall, f1-score and accuracy)
Conclusion: Augmenting the data did not improve either of the models. It performed best on just normalised data

Analysis pipeline:
1. We used a confusion matrix, to see where the models had a difficult time classifying to
2. Then we used a GradCAM model, to see where (for class c8 and c9) it struggled to capture the important features

Question 6

Q

What are the strengths and weaknesses of your driver-aware data splitting strategy? Could it introduce any unintended biases?

Answer

A

the strength: no data-leakage and robustness
the weaknesses: lower accurayc than what we saw on Kaggle
unintended bias: if there were less gender diversity or skin type diversity in the training sets, the models would have a hard time generalising on it for testing

Question 7

Q

Why did you choose to use transfer learning instead of training a model from scratch, and what are the implications of that choice in your context?

Answer

A

We decided on just using transfer learning, as we deemed there was not enough data for the pre-training and also fine-tuning. Maybe we could have used another driver dataset for pre-training, but since we already have good defined weights from the pre-trianing on ImageNet, it would have been a waste of resources

Question 8

Q

Why did you try to use YOLO

Answer

A

WE saw they used YOLO in kaggle, but we did not have time to train it from scratch ourselves, therefore we opted for GradCAM to capture the saliency mapping

Question 9

Q

What augmentation techniques did you use, and why are they important for your dataset? Were any of them harmful or neutral?

Answer

A

we probably did not use enough (or aggressive enough) augmentation. We did also try something that was too aggresive, but it did not work, as we thought it would, so we settled with techniques, that did not change the images that much. We did not the sweetspot of teaching the model something new, while also working

Question 10

Q

How did you determine which model (ResNet-50 vs. VGG16) was better? Which evaluation metrics did you rely on, and why?

Answer

A

we made sure to rely on several metrics, and the other kaggle participants.
what we saw was better performance for VGG16, and in our project, we saw the same, VGG16 performed slightly better than ResNet50. The reason for this could be due to ResNet50 being too deep, and VGG16 havign more parameters to train, but keep in mind, we did not do any statistical significance tests (and we did not see any either)

Question 11

Q

Explain how ResNet’s skip connections work. Why are they useful in deep networks? What would happen if we removed them?

Answer

A

The skip connections does so we can skip through the model’s layers as we please. this avoids overfitting and vanishing gradients, a problem in “plain” neural networks

Question 12

Q

What role does regularization play in your models? Can you explain how dropout or weight decay affects training in deep nets like VGG16?

Answer

A

Regularization helps prevent overfitting by discouraging the model from memorizing the training data too precisely. It promotes generalization, which means better performance on unseen (validation/test) data.

We used dropout, which forces the network to not rely too heavily on any one neuron. Encourages the network to develop redundant, robust representations. Helps prevent co-adaptation, where neurons learn to depend on each other. This was especially useful for VGG16 which has many parameters

We could have added some kernel_regulariser in the dense layers, if we thought about it.

Question 13

Q

How does transfer learning help in situations with limited labelled data like yours? Could fine-tuning the entire model have helped more than freezing most of the layers? Why or why not?

Answer

A

the dataset is not that big for it to be trained on deep learning models like the ones we used.

it would overfit again, if we used all the layers for the fine-tuning step.

Question 14

Q

What other model could you have chosen?

Answer

A

Vision Transformers (ViT)
Why use it? Transformers for images, capable of global attention mechanisms.

Tradeoff: Needs more data or strong regularization techniques to perform well.

Good for: Research-focused comparison or if your dataset is large or diverse.

A Vision Transformer is a model that applies the transformer architecture to image patches, using self-attention to learn global image features, unlike CNNs, which focus on local patterns.

Question 15

Q

Why did it overfit so much?

Answer

A

Because we probably did too little augmentation, and we could maybe just have frozen all but 1 layer.

PROJECT Flashcards

(15 cards)