Week 11: Mechanistic Interpretability & Explainability Flashcards

Question 1

Q

What is the difference between interpretability and explainability in machine learning?

Answer

A

Interpretability = Understanding how a model works internally

Explainability = Providing explanations for specific predictions

Question 2

Q

Why is interpretability important for deep learning models? Give at least 3 reasons.

Answer

A

A:

Safety & Bias Detection: ML models can make unexpected decisions that create harms, show biases, or have subtle safety issues

Shortcut Learning: Neural networks often take shortcuts that don’t generalize as expected (e.g., using background to recognize objects instead of the object itself)

Trust & Verification: Need to verify the model learns and performs as intended, especially in high-stakes applications like healthcare or autonomous driving

Regulatory Requirements: Many domains require explainable AI for legal/ethical compliance

Question 3

Q

What are the three main categories of methods for explainable machine learning?

Answer

A

Inherently Interpretable Models: The model IS its explanation (e.g., linear regression, decision trees)

Semi-Inherently Interpretable Models: Example-based methods (e.g., k-nearest neighbors)

Complex Models: The explanation gives a partial view of the model (e.g., deep neural networks) - requires post-hoc explanation methods

Question 4

Q

What are the main approaches for visualizing and understanding neural network weights?

Answer

A

Direct Weight Visualization: Hinton diagrams showing connection strengths

Contextualised Weights: Visualize meaning of weights in context of layer input/output

Difficulties: Lack of contextualization, indirect interactions, dimensionality and scale issues

Question 5

Q

What is Grad-CAM and how does it work?

Answer

A

A:
Gradient-weighted Class Activation Mapping (Grad-CAM):

Purpose: Attribution of local input importance for classification decisions
Process:

Calculate importance of each feature for classification
Compute importance-weighted feature sum at each location
Resample contribution map to original resolution
Superimpose on original image

Applications: Image classification, captioning, visual question answering

Question 6

Q

What do SHAP values explain and how are they derived?

Answer

A

SHAP (Shapley Additive exPlanations):

Explains: How each feature affects the final prediction compared to a baseline

Properties:

Additive in nature
Show significance of each feature compared to others
Reveal model reliance on feature interactions

Key Insight: SHAP values sum up to the difference between baseline (expected) model output and current model output
Applications: Works for any model type (linear regression, deep networks, etc.)

Question 7

Q

Week 11: Mechanistic Interpretability & Explainability Flashcards

(7 cards)