Week 11: Mechanistic Interpretability & Explainability Flashcards
(7 cards)
What is the difference between interpretability and explainability in machine learning?
Interpretability = Understanding how a model works internally
Explainability = Providing explanations for specific predictions
Why is interpretability important for deep learning models? Give at least 3 reasons.
A:
Safety & Bias Detection: ML models can make unexpected decisions that create harms, show biases, or have subtle safety issues
Shortcut Learning: Neural networks often take shortcuts that don’t generalize as expected (e.g., using background to recognize objects instead of the object itself)
Trust & Verification: Need to verify the model learns and performs as intended, especially in high-stakes applications like healthcare or autonomous driving
Regulatory Requirements: Many domains require explainable AI for legal/ethical compliance
What are the three main categories of methods for explainable machine learning?
Inherently Interpretable Models: The model IS its explanation (e.g., linear regression, decision trees)
Semi-Inherently Interpretable Models: Example-based methods (e.g., k-nearest neighbors)
Complex Models: The explanation gives a partial view of the model (e.g., deep neural networks) - requires post-hoc explanation methods
What are the main approaches for visualizing and understanding neural network weights?
Direct Weight Visualization: Hinton diagrams showing connection strengths
Contextualised Weights: Visualize meaning of weights in context of layer input/output
Difficulties: Lack of contextualization, indirect interactions, dimensionality and scale issues
What is Grad-CAM and how does it work?
A:
Gradient-weighted Class Activation Mapping (Grad-CAM):
Purpose: Attribution of local input importance for classification decisions
Process:
Calculate importance of each feature for classification
Compute importance-weighted feature sum at each location
Resample contribution map to original resolution
Superimpose on original image
Applications: Image classification, captioning, visual question answering
What do SHAP values explain and how are they derived?
SHAP (Shapley Additive exPlanations):
Explains: How each feature affects the final prediction compared to a baseline
Properties:
Additive in nature
Show significance of each feature compared to others
Reveal model reliance on feature interactions
Key Insight: SHAP values sum up to the difference between baseline (expected) model output and current model output
Applications: Works for any model type (linear regression, deep networks, etc.)