Topic 12: Geometric Deep Learning Flashcards by Niko Dice

What is a graph?

Has vertices/nodes, is connected by edges and directions. can have universal/global attributes (features that apply to the entire graph). each is represented by a vector

How well did you know this?

Not at all

Perfectly

What is a euclidean graph and an arbitrary graph?

euclidean graph: the structure and neighbourhood is well-defined in a euclidean space. it has a simple geometric alignment

arbitrary graph: this is non-euclidean. they’re of any complex geometric, they can be irregular, and has inherently a non-linear relationship

How well did you know this?

Not at all

Perfectly

describe data from a graph perspective

the data = signal + structure
so the data is comprised of two components:
signal: a specific encoding of features or variables
structure: the encoding of the relations between the variables

How well did you know this?

Not at all

Perfectly

What are some graph representations?

Adjacency matrix: a direct relationship of the orderede nodes. it can be sparse and banded, and it easily represents the relationship weight
adjacency list: it’s a condensed representation, where it’s easy to add noted, and the relationship weights are added on as a third value. consists of tuples of the edges. its inherently more efficient as a data structure

How well did you know this?

Not at all

Perfectly

What can you say about deep learning so far?

Neural networks are typically viewed as arbitrary function approximators. They can model both simple and complex functions by combining many basic step-like transformations. The complexity of the function a network can learn depends on the number of neurons per layer and how many layers are stacked. With enough layers and units, a neural network can approximate highly intricate patterns and decision boundaries.

A question is: how do we find the right solution, that is, the correct set of weights? To address this, we apply two crucial strategies:
- Constraining the architecture: to introduce useful structural biases into the model.
- Constraining the learning process: to guide optimisation, using tools like backpropagation, regularisation, and initialization schemes.

CNNs are a specific architectural constraint designed for spatial or image-like data. They enforce a hierarchical layer structure, where each layer captures increasingly abstract features from the input. The central component of CNNs is the convolution operation, where small filters slide across the input to extract local features.

These filters are adjustable and learn arbitrary transformations, but always within a localised region. The use of shared weights and local receptive fields improves both efficiency and generalization. This makes CNNs especially powerful for tasks like image recognition, where spatial structure matters.

Transformers extend neural architectures to capture global dependencies in input data, such as sequences or tokenized text. Unlike CNNs, which rely on local filtering, Transformers use attention mechanisms to dynamically connect any input position with any other.

This means the model learns an arbitrary configuration of connections, often spanning the entire input. Attention vectors are not fixed, they can link to previous or even future positions, depending on the task. The Transformer block, composed of multi-head attention, feed-forward layers, and normalization, forms the backbone of modern models like GPT and BERT.

Generative models aim to learn data distributions and generate new samples that resemble the training data. Architecturally, they often incorporate elements from CNNs and Transformers. A key example is Stable Diffusion, which learns to transform a noise distribution into structured data, such as images, through a series of denoising steps.

These transformations typically operate in a low-dimensional latent space, allowing the model to capture high-level structure efficiently. Generative models include classes like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models—all of which are central to tasks like image synthesis, data augmentation, and creative AI applications.

How well did you know this?

Not at all

Perfectly

What is the curse of dimensionality?

Data modelling: when we model data, we might have many variables/features that lead to large dimensionalities

curse of dimensionality:
- as the dimensionality increases, the volume of the space increases exponentially, making the data appear sparse
- the distance between the data points become less meaningful

Lipschitz function: a composed Lipschitz function with many small “bumps.” These bumps are hard to detect in high dimensions unless you’re lucky enough to sample very close to them.

This illustrates that:
Even “well-behaved” functions (like Lipschitz ones) can become very hard to learn in high-dimensional spaces.

You’d need an exponential number of samples to capture all the fine structure.

GEOMETRIC PRIORS CAN TACKLE THE CURSE OF DIMENSIONALITY

How well did you know this?

Not at all

Perfectly

what are manifolds?

manifold: a topological space that appears like a euclidean space near each point.
the data usually concentrates around the low-dimensional manifolds

Although real-world data often lives in a high-dimensional space (e.g. 784 pixels for MNIST images), it typically doesn’t fill that space uniformly. Instead, the data tends to concentrate around low-dimensional manifolds embedded within the high-dimensional space.

These manifolds represent the underlying degrees of freedom or factors of variation in the data, such as shape, orientation, lighting, or style, which vary smoothly and in far fewer dimensions than the raw data itself.

How well did you know this?

Not at all

Perfectly

What is the Group Invariance Theorem, and why is it important for deep learning?

The Group Invariance Theorem states:
“if a neural network is invariant to a
group, then its output can be expressed
as functions of the orbits of the group”

Introduces a geometric interpretation of neural networks.
Suggests that symmetry-aware models (e.g., CNNs) can leverage group structures like rotation, translation, etc.
Helped move beyond the limits of simple perceptrons.
Precursor to the Universal Approximation Theorem and modern Geometric Deep Learning.

How well did you know this?

Not at all

Perfectly

What is the Hubel & Wiesel experiment?

How does different brain cells react to different stimuli. towards a hierarchy of simple and simple and complex brain cells

How well did you know this?

Not at all

Perfectly

How did early neuroscience and models like the Neocognitron influence deep learning architectures?

Key Idea:
Understanding how the brain processes spatial patterns hierarchically laid the groundwork for modern convolutional neural networks.

Hubel & Wiesel (1959–1968):
- Discovered simple and complex cells in the visual cortex.
- Identified increasing receptive fields across layers.
- Inspired neurocognitive models of vision.

Neocognitron (Fukushima, 1980):
- Early geometric neural network.
- Introduced translation invariance in responses.
- Acted as a precursor to convolution & pooling layers in CNNs.

How well did you know this?

Not at all

Perfectly

What symmetries have you encountered
in CNN and/or in Transformer transformations?
(Transformation: representation change
from one layer to the next layer)

symmetry: how convolutions work. we have an input, then we have a filter, and an output that is the result of the input and the filter.

Symmetry: A transformation of the input that doesn’t change the essential structure or meaning of the task.

Equivariance: A transformation of the input causes a predictable, structured transformation in the model’s internal representation.

the convolution operation is inherently translation equivariance. no matter where on the input we are, we wil find something with the convolution to become the output.

How well did you know this?

Not at all

Perfectly

Explain symmetry in LSTMs

LSTM: we have the block, where there is an input and an output, and the gates that will detemine how much the input is actually used. These gates can in reality completely block the information flow, and therefore this can also reduce the vanishing gradient problem. LSTM also allows for calculating time warping

Time warping = variation in the rate at which time progresses in the data.
LSTMs are invariant to time warping because their gating mechanisms allow them to adaptively remember or forget information, regardless of how fast or slow it arrives.

How well did you know this?

Not at all

Perfectly

Explain symmetry in transformers

Transformers use Attention
- Each input token attends to all other tokens.
- There are no fixed edges, instead, attention defines dynamic, data-dependent connections.
- This makes Transformers effectively graph neural networks operating on a complete graph (i.e., every token can connect to every other).

Symmetry in Transformers: Permutation
Permutation Invariance:
- The output is the same regardless of the order of input tokens (in tasks where order doesn’t matter).

Permutation Equivariance:
- The model’s internal representations change in a predictable way if you permute the input.

Without positional encodings, attention is inherently permutation invariant, it treats all positions equally.

With positional encodings, the Transformer learns to respect order while still leveraging global attention.

This flexible symmetry handling makes Transformers powerful for both structured (ordered) and unstructured (unordered) data.

How well did you know this?

Not at all

Perfectly

What is the role of symmetry in geometric deep learning, and how do CNNs and Transformers reflect that?

Invariance:
Model output is unchanged (e.g., cat image → scaled cat image → still “cat”).

Equivariance:
Input and output both change, but in a consistent, structured way (e.g., feature map shifts with image).

CNNs:
- Use translational symmetry: filters slide across input.
- Equivariant under shifts; invariant after pooling.

Transformers:
- Generalise CNNs by handling permutation symmetry (inputs treated as a set, not fixed sequence).
- Learn structure through attention, with or without order.

These symmetries are built-in priors that help models learn more efficiently from structured data

How well did you know this?

Not at all

Perfectly

What is geometric stability?

In our domain we have a specific signal, and a specific shift operator. We would see the shift being translation invariance. We might also allow for a specific distortion, this could be an involuntary modification in the signal, but not in the domain, so it’s not a systematic deformation of the domain, but a local distortion.
Humans and CNNs can handle slight distortions and noise, and still be able to recognise the input

How well did you know this?

Not at all

Perfectly

Describe how the geometric stability looks

Study These Flashcards

The whole space of invertible structure that preserved the map, is called the automorphism group.
The symmetry would still be around the symmetry group.
Any deformation we allow for might be non-rigid and slight, and might even be quantifiable.

The automorphism group is the set of invertible, structure-preserving transformations, the ideal symmetries that the model is designed to handle (e.g., exact translations, rotations).

Stability means the model stays close to the correct output even when:
Transformations go slightly outside this ideal group,
Inputs are distorted or deformed, not exactly shifted.

Think of it like a “soft symmetry”: the model doesn’t just recognize perfect copies, but also slightly altered versions.

What is scale separation?

Study These Flashcards

Scale separation allows a model to analyse data at different levels of detail, using tools like wavelets to build multi-scale representations, which is crucial for capturing both structure and fine-grained information.

Wavelets are localized in both frequency and space, capturing local structure at multiple resolutions.

Useful for detecting patterns like edges or textures at different scales.

Wavelet atoms form a multi-scale basis that allows decomposition of a signal into coarse and fine components.

Many physical and learning tasks are naturally multiscale (e.g., vision, speech).

CNNs and geometric networks can use this principle to learn hierarchical features.

Helps reduce redundancy, increase interpretability, and improve generalization.

Give an example of scale separation prior

Study These Flashcards

If we coarse some images of a beach and a mountain, we can still see the large scale structures, helping us determine that it is a beach or a mountain.
The assumption is that the coarse scale dominates, and that we can reduce the resolution to a lower dimensionality and still preserve the meaning dim(X)∝|Ω|≪|Ω| (lower res reduces the curse of dimensionality)

if we were dealing with the mnist dataset and coarse/blur the numbers, we might lose too much detail (the 3 now looks like an 8). the model needs local features like edges and loops
The assumption is that the fine-scale detail dominate, and that we can’t downsample too much, or else the performance will drop

What is compositionality in scale separation priors?

Study These Flashcards

Compositionality with scale separation means learning to simplify input (via coarsening) before making predictions.
This leads to better approximation, efficiency, and interpretability, especially when combined with symmetry-based priors.

Like first blurring an image to get the big picture (P), then making a prediction based on that (𝑓̃), instead of trying to learn everything at once from noisy raw data.

Benefits:
- Provable approximation: Theoretical results show that composition of simpler functions can approximate complex ones.
- Efficiency: Coarsening reduces data size and noise, improving learning and generalization.
- Composability: You break a hard problem into two easier ones.
- Theoretical flexibility: Even if the full multiscale hypothesis space 𝐹 is hard to understand, composing functions is tractable.
- Can be combined with symmetry priors: → e.g., translation invariance from CNNs, permutation from Transformers.

What are some examples of multi-scale phenomena?

Study These Flashcards

Domain - Coarse Scale - Fine Scale
Vision - Scene layout, object placement Textures, edges, facial features

Speech - Sentence prosody (intonation) - Phonemes, pitch variation

Weather/Climate - Global temperature zones - Local storms, turbulence

Biology - Organs or tissues - Cells, molecular structures

Natural Language - Sentence structure - Word choice, punctuation

What are some examples of Multi-Scale ML Solutions?

Study These Flashcards

ML Solution: How it captures multi-scale structure
CNNs: Pooling and strides reduce resolution layer by layer

Wavelet Neural Networks: Use wavelet transforms to process inputs at multiple scales

U-Net (for segmentation): Encoder-decoder with skip connections at different scales

Graph Coarsening / Pooling: Hierarchical pooling over graphs (e.g., DiffPool, SAGPool)

Multiresolution RNNs: Handle both slow and fast temporal patterns (e.g., HRNNs)

Explain geometric learning for learning stable representations?

Study These Flashcards

Geometric learning builds on the idea that data lives on structured domains (like grids, graphs, or manifolds), and that symmetries in these domains (like translation or rotation) can guide learning.

The key spaces of geometric learning:
Domain Ω: The space the data lives on (e.g., image grid, graph).
Signals/feataures X(Ω): Features defined over that domain (e.g., pixel values).
Hypothesis class F(X(Ω)): Functions we want to learn.
Symmetries (as group imposes a structure on F): Models how transformations (like shifts or rotations) act on data and define:
- Equivariance: 𝑓(𝜌(𝑔)𝑥) =𝜌(𝑔)𝑓(𝑥) → output transforms in the same way.
- Invariance: 𝑓(𝜌(𝑔)𝑥) = 𝑓(𝑥) → output stays the same.

Blueprint for Stable Representations:
- Local equivariant maps (e.g., convolution)
- Global invariant maps (e.g., pooling, mean, classification)
- Nonlinearity (e.g., ReLU, attention)
- Coarsening operators (e.g., pooling, hierarchical layers)

These building blocks help construct models that are robust, efficient, and generalisable, ideal for structured data like images, graphs, and physical simulations.

What are Bronsteins 5Gs?

Study These Flashcards

Grids: Images and sequences
- used in CNNs, symmetry -> translation
Groups: Homogenous spaces
- used in group CNNs, symmetry -> group actions
Graphs: Graphs and sets
- Used in GNNs, symmetry -> group actions
Geometric graphs and gauges: Manifolds, meshes and geometric graphs
- used in GNNs, symmetry -> permutation

We started with a perceptron

How does GDL apply across common ML models?

Study These Flashcards

Model - Symmetry Type - Domain Example

Perceptrons - Function regularity - Vector space 𝑅^𝑛

CNNs - Translation - Grid/image

Group CNNs - Translation + rotation Sphere

LSTMs - Time warping - 1D sequences

Transformers/DeepSets - Permutation - Sets / tokens

GNNs - Permutation - Graphs

Intrinsic CNNs - Isometry, gauge - Manifolds, meshes

Describe two of the three geometric priors according to Bronstein

**Symmetry (Invariance/Equivariance)** What it means: This prior says: If I transform the input in a certain way, the model’s prediction should transform in a predictable way, or stay the same. - Invariance: output stays the same. - Equivariance: output changes in sync with the input. Example: - For images, CNNs are translation-equivariant: if you shift the image, the feature maps shift the same way. - For graphs, GNNs are permutation-invariant: if you shuffle the node order, the output stays the same. Why it matters: This lets the model generalize better with less data. It doesn’t need to “re-learn” the same pattern in every orientation or order. **Compositionality** What it means: The model learns by focusing on local interactions, and builds up complex global understanding by combining many small, local computations. Example: - CNNs use local convolution filters (e.g., 3×3) on images. - GNNs aggregate messages from neighboring nodes in a graph. Why it matters: This reduces computational complexity and makes learning tractable, the model only needs to learn local rules and can stack layers to model more complex, long-range dependencies.

Topic 12: Geometric Deep Learning Flashcards

(25 cards)