Topic 11: Mechanistic Interpretability Flashcards

Question 1

Q

What is interpretability and explainability?

Answer

A

Interpretability: Being able to determine the cause and effect from an ml model. we would like to know what the ML approach indeed did learn.

In that we have explainability: knowing what a node represents and its importance to the model’s performance

Question 2

Q

How do neural networks take shortcuts?

Answer

A

This is because of gradient descent. We punish in most training schemes, things that are not really contributiing or important. Gradient descent “follows the path of least resistance.” If there’s a cheap shortcut that gets the job done (like texture, background, or last sentence), the model will take it, unless you design your data or architecture to avoid it.

Task: Image Captioning
Problem: Describes a green hillside as “grazing sheep” — even when there are no sheep.
Shortcut: Model relies on background (green fields) as a cue, not the actual object (sheep).

Task: Object Recognition
Problem: Hallucinates objects (e.g. a teapot) in noise patterns that look random to humans.
Shortcut: Uses non-human-recognizable features, likely statistical artifacts, not shape or edges.

Task: Question Answering
Problem: Changes the answer if irrelevant text is added.
Shortcut: Focuses only on the last sentence, ignoring broader context and logical reasoning.

Models often optimise for what works on training data, not what’s semantically correct. This leads to brittle generalisation, vulnerability to adversarial attacks, and trust issues.

Question 3

Q

What is supervision collapse?

Answer

A

Supervision collapse exposes how models can look smart but generalise poorly.

Forcing the network to map two different images very close to each other might end up in the network learning to cheat, and map EVERY input into a one same representation. This simply happens because our network learns to create a shortcut for minimising loss, and get zero loss by just mapping everything to the same input.

Question 4

Q

Describe the interpretability framework

Answer

A

We have some goals:
- What blindspots can we discover
- What novel insights can we gain
- Why the model makes a particular decision for a certain task
- How can we have the model to perform more like a human, maybe in the way that a human can be involved in different stages of the standard ML model

For the human involvement, we could add:
- a social factor: the context, the end task and so on
- a technical factor: the methods, metrics, properties and so on

We might have a loop with the between the development of the model and the inference

Question 5

Q

What are some potential harms from large language/vision models?

Answer

A

Sometimes they can learn things we dont want, such as:
- Membership inference, because it can memorise specific parts of the data, instead of generalising
this is because the data is semi-private, it can consist of things such as addresses amd emails
- learning the prejudices and biases og human beings who are online

Question 6

Q

What are the challenges in open-ended tasks?

Answer

A

Data contamination: happens when LLMs are tested on examples they’ve already seen during training, either exactly or in slightly modified form. Because training and testing datasets often come from the same internet-scale sources, overlap is hard to avoid.

Overfitting: We reach a human aligned performance too quickly in the benchmarks. a solution to this is DynaBench (framework), which controlt the number of times one can see the test set, and constantly changes the input

Monoculture of NLP benchmarking: Most of these benchmark models are created in English. We do have multi-lingual models, which we should use instead. But we’re far away from monitoring these with the metrics.

Biases: we can have biased metrics, such as overlapping metrics, which are good for certain characteristics (not very flexible). There are also biased LLM-based evaluations, such as the refelcting preferences from a small group of people or the biases of the annotators

Question 7

Q

What do you expect, the CNN does represent in its layers?

Answer

A

WE WILL VISUALISE THE FEATURES, so such as the edges, textures, patterns, parts, and then objects of the input
Early laters: CNNs typically captures low-level features such as edges, textures, and simple patterns. These layers operate almost like edge detectors or filters sensitive to orientation and color gradients.

Middle layers: the network begins to recognise more complex combinations of these low-level features, such as corners, contours, and small object parts like eyes, wheels, or textures of fur and metal. These representations are more semantically meaningful but still not object-specific.

Deepest layers: the CNN forms high-level, abstract representations that correspond to entire objects or concepts, such as recognizing a dog, car, or a specific traffic sign. These layers integrate information from a large receptive field and are more invariant to changes in position, scale, and lighting.

In summary, CNN layers form a hierarchical feature pipeline:
- Early: edges, colors, textures
- Middle: shapes, object parts, patterns
- Deep: objects, categories, semantic understanding

This layered representation is what enables CNNs to perform tasks like object detection, image classification, and segmentation effectively.

Question 8

Q

How do we use PCA to visualise the trained model?

Answer

A

Questions: What has the network learned regarding the relevance of features for a certain class?
Approach: Reduce the learned representations with PCA:
- We can inspect the hidden activations (on various layers) for the protitypical network output (classes)
- We can reduce the representation complexity for visualisation
- We can then visualise the most important Principal Components. We can also try to visualise the trajectories of the NN, such as the motions from the Multiple Timescale Recurren Neural Network

Question 9

Q

What are some methods of explainable ML?

Answer

A

Inherently interpretable models:
These models are simple enough that the model itself serves as the explanation.
- Examples: Linear regression, decision trees
- Properties: Fully inspectable, compact, sparse, or logic-based
- Ideal for: Clear understanding of decision boundaries and model reasoni

Semi-inherent interpretable models: These use example-based reasoning, explanations come from comparisons rather than the model’s structure.
- Example: K-nearest neighbors (KNN)
- Properties: Relatively intuitive but can be complex depending on data and distance metrics. They rely on specific examples or test data to justify predictions.

Complex Models: For models that are too complex to be directly understood, explanations are partial or approximated.
- Methods: Post-hoc tools (SHAP) and visualisation techniques
- Example: Deep Neural Network
- Needs: Decisions on what to explain (features, concepts), and how (globally or locally)
- Goal: Improve transparency and interpretability despite model opacity

Question 10

Q

How do we find the transparency and visualisation of the trained model?

Answer

A

Goal: To understand qualitative characteristics of the trained model
First example: We visualise by inspecting the weight matrices
Question: What is the relation between (hidden layer) connections?
Approach: Visualise the connection strength directly
Difficulties:
- There’s a lack of contextualisation
- There’s an indirect interaction
- There’s dimensionality and scale

Question 11

Q

What are Hinton Diagrams?

Answer

A

Hinton diagrams are useful for visualizing the values of a 2D array (e.g. a weight matrix): Positive and negative values are represented by white and black squares, respectively, and the size of each square represents the magnitude of each value.
- You visualize the meaning of a weight in terms of the input and output it connects.
- For instance, when a network processes an image of a dog, the corresponding weight connections can be visualized alongside input and output patterns. This shows not just that a connection exists, but what it does in practice.
- Additionally, by looking at how multiple input patterns activate certain connections, you get a better picture of how specific neurons participate in complex representations.

Question 12

Q

How do we visualise features by optimisation?

Answer

A

We initialise a random image for prototypical network output (classes).
We …:
- Calculate the gradient for increasing the neuron responses
- Adjust the image based on the gradients
- REPEAT.

We used the gradient descent in every epoch to shape the weight, the specific fetaure and how it looks in detail. By shifting the weights, we can shift it to activate what we need for a specific class. This can then be inverted, so we can get the output! (similar to the diffusion process).

Question 13

Q

What do we target when applying optimisation for feature visualisation?

Answer

A

We can apply the gradient calculations on units or layers of interest:
- Neuron level to visualize individual learned features.
- CNN channel level to understand feature maps.
- Layer level to explore hierarchical abstraction (as in DeepDream).
- Dense/logit layers to see a combination of prominent features.
- Class probability to reveal the most relevant features that influence classification.

This allows fine-grained inspection of what each part of the model is sensitive to.

Question 14

Q

What is deconvolution?

Answer

A

We take a normal neural network and invert the whole process, so we can unpooling and then reconstruct it and so on. This is to calculate at any level:
- starts by recording which neuron values are preserved during max-pooling, this helps track the most significant activations.
- Then, the convolutional operations are reversed to map the activations back into input space, effectively reconstructing an approximation of the original image that caused a particular response.
- Alternatively, instead of explicitly reversing operations, we can optimise an input image so that it reproduces the same activation patterns as observed in the network.
- Another option is to train a separate convolutional network to learn how to reconstruct the original input image from the activation maps. These approaches help us visualize what features the network is responding to and enhance model interpretability.

Question 15

Q

What is Grad-CAM and how is it used to interpret trained CNN models?

Answer

A

Grad-CAM (Gradient-weighted Class Activation Mapping) is a method for visualising the attributions of an image.

We can use the gradients from something else. With the gradients we can now try and say “how about we figure out the attributions in the trained model?”
We have made filters for the input image, such as “pointy ear filter, claw filter” and so on, for finding the cat.

with the particular image, we also want to see what part of the image is the most relevant for classification, is it the pointy ears, is it the body?

in the backpropagation we can figure out what filters are mostly used to describe the cat (the distribution of the gradients).

we can figure out the attribution of the local input to the importance of the class. this is also called “mapping saliency (most relevant) onto the input”

there can also be revealed biases, from the saliency mapping. e.g. in the nurse/doctor classification, it should look at what tools there might be in the pictures, instead of hair

Question 16

Q

How do we apply Grad-CAM?

Answer

Study These Flashcards

A

For a given class, it finds the gradient of the output score with respect to the convolutional feature maps. These gradients show how much each feature map influences the final prediction.

Grad-CAM then:
- Weighs the feature maps by these gradients.
- Aggregates the information across spatial locations.
- Resamples the resulting heatmap to match the input resolution.
- Superimposes this heatmap on the original image to highlight important areas.

This helps visualise what parts of the input are most relevant for tasks like image classification, captioning, or visual question answering.

Question 17

Q

How can we use multi-head attention with heatmaps?

Answer

Study These Flashcards

A

We can visualize attention as a 2D heatmap where each point (𝑖,𝑗) shows how much the model at position 𝑖 attends to position 𝑗. These maps help interpret which parts of the input are most influential during processing.

Key patterns:
- Diagonal: token attends to itself or nearby tokens (e.g. local context).
- Vertical stripes: some tokens (e.g. [CLS], verbs) are globally important for many others.
- Horizontal stripes: the token attends broadly across the entire sequence.

By analysing these patterns across the heads, we can understand how information is aggregated across the sequence.
Having all those stripes, that token would be more informational.

Question 18

Q

What is attribution with SHAP?

Answer

Study These Flashcards

A

Say we have a black box model, and we want to measure the contribution of the input features to the model prediction.

The approach to this is to use SHAP: it’s a method for attributing a model’s prediction to its input features. It calculates the contribution of each feature by comparing predictions with and without the feature.
- how does each feature affect each final predictions?
- what is the significance of each feature compared to others?
- does this show the model’s reliance on the interaction between the featues

SHAP values are derived from Shapley values, which compute a feature’s average contribution to the model output across all possible subsets of features. This ensures a fair and consistent way to attribute credit for the prediction.

Based on Shapley values from game theory
Quantifies how much each feature increases or decreases the output
Useful to understand model reliance and feature interaction

Question 19

Q

What is representation engineering in the trained model?

Answer

Study These Flashcards

A

Representation engineering is the process of investigating what information is captured inside a trained model, particularly at different layers. The main goal is to determine whether a specific feature or concept is “present” in a given layer of the model.

A common approach is to use linear probes:
- Select a layer from the model.
- Extract the activations from that layer.
- Train a simple linear classifier on those activations to predict a property (like part-of-speech tags or factuality).
- Evaluate the classifier’s performance — strong performance implies that the information is encoded in that layer.

This helps uncover how features are distributed across the network and how deeply embedded certain types of information are, such as in tasks like lie detection or NLP property probing.

Question 20

Q

What is attribution?

Answer

Study These Flashcards

A

The point of attribution is to figure out, is there someting specific in our dataset that is relevant for inference? It’s difficult to scale, but now we have three methods:
- Grad-CAM
- Multi-head attention
- SHAP

Question 21

Q

What are circuit analysis and monosemantic features in transformer interpretability models?

Answer

Study These Flashcards

A

NOT IMPORTANT

Both circuit analysis and monosemantic features are advanced methods used to understand how transformer models represent and process information.

Circuit analysis focuses on identifying computational subgraphs (like induction heads and name mover circuits) that implement specific functions inside the model. Researchers use techniques like activation patching, causal tracing, and ablation studies to isolate and test the behavior of these substructures, helping reveal how complex computations are composed internally.

Monosemantic features refer to interpretable representations where a neuron or direction in the model corresponds to a single, meaningful concept (e.g., a sentiment, entity, or grammatical role). This approach aims to make models more transparent by decomposing them into sparse, human-aligned components, often using dictionary learning. It’s part of ongoing work to make model behavior more explainable at scale.

Question 22

Q

What is the role of the residual stream in transformer models?

Answer

Study These Flashcards

A

In transformers, each block modifies the signal sequentially. This modification happens through layers like multi-head attention and feedforward networks, wrapped with residual connections and normalization.

The residual stream offers a key information-processing perspective, it allows us to track how information is read and written across layers. This perspective shifts the focus from the computations done inside the block to the representational space that is being modified. It provides insight into how features accumulate and evolve throughout the network.

Question 23

Q

Answer

Study These Flashcards

A

Topic 11: Mechanistic Interpretability Flashcards

(23 cards)