Week 6: Representation Learning Flashcards

Question 1

Q

What is a latent component?

Answer

A

A latent component is a hidden feature or pattern in data that you cannot directly observe but explains important characteristics of the data.

Question 2

Q

What is the curse of dimensionality and how can we address it?

Answer

A

The curse of dimensionality is the problem that occurs when data has too many features/variables - as dimensions increase, data becomes increasingly sparse and distances between points become less meaningful.

Dimensionality Reduction: Remove unimportant features (PCA, LDA, ICA)
Feature Selection: Keep only the most relevant features
Data Compression: Use autoencoders to learn compressed representations

Question 3

Q

What is transformation by regression in the context of representation learning?

Answer

A

Transformation by regression is learning a parametric model that maps input data to output values through supervised learning.
Key Points:

Classification: Maps input x to discrete class y
Regression: Maps input x to continuous value y
Goal: Learn from labeled data to predict outputs for new inputs
Method: Supervised machine learnin

Question 4

Q

What is transformation by correlation (PCA) and how does it work?

Answer

A

Transformation by correlation refers to Principal Component Analysis (PCA) - a linear, non-parametric approach to reduce dimensionality by finding directions of maximum variance.
How It Works:

Find correlations in the data
Create new coordinate system where:

First axis = direction of greatest variance (first eigenvector)
Second axis = second greatest variance
And so on…

Keep only first few components that explain most variance
Ignore components with small eigenvalues (low variance)

Question 5

Q

What is Independent Component Analysis (ICA) and how does it differ from PCA?

Answer

A

Key Difference from PCA:

PCA: Finds orthogonal components with maximum variance

ICA: Finds statistically independent components (non-Gaussian sources)

Question 6

Q

What is a bottleneck in an autoencoder and why is it important?

Answer

A

Definition:
A bottleneck is a narrow hidden layer in an autoencoder that forces the network to compress information into a much smaller latent representation.

Question 7

Q

What is transposed convolution and why do we use it in autoencoders?

Answer

A

Transposed convolution (also called “deconvolution”) is the reverse operation of regular convolution. It upsamples feature maps to make them larger.

Question 8

Q

What is an autoencoder and how does it work?

Answer

A

An autoencoder is a neural network that learns to compress data into a latent representation and then reconstruct the original data from that representation.
Architecture:

Encoder: Compresses input x ∈ ℝᴰ → latent code z ∈ ℝᴸ (where L < D)
Decoder: Reconstructs latent code z → output r(x) ∈ ℝᴰ

Training:

Loss function: Reconstruction error L(θ) = ||r(x) - x||²
Goal: Minimize the difference between input and reconstructed output
Learning: Network learns meaningful latent representations automatically

Question 9

Q

What is a Denoising Autoencoder and why is it useful?

Answer

A

A Denoising Autoencoder (DAE) is an autoencoder trained to reconstruct clean data from corrupted/noisy input.
How It Works:

Add noise to clean input data (Gaussian noise, dropout, etc.)
Feed noisy input to the autoencoder
Train to reconstruct the original clean version
Loss function: ||clean_original - reconstruction||²

Types of Noise:

Gaussian noise: Random pixel values added
Bernoulli/Dropout noise: Random pixels set to zero
Salt & pepper noise: Random black/white pixels

Why It’s Powerful:

Robustness: Learns features that work even with corruption
Better generalization: Prevents overfitting to exact pixel values
Meaningful features: Forces network to learn underlying patterns, not noise
Regularization effect: Acts as implicit regularization

Key Insight:
By learning to “clean up” noisy data, the network learns more robust and meaningful representations of the underlying data structure.

Question 10

Q

What is a Sparse Autoencoder and how does L1 regularization create sparsity?

Answer

A

Definition:
A Sparse Autoencoder uses regularization to ensure that only a few neurons in the latent layer are active at any time, creating sparse representations.
L1 Regularization Method:

Add penalty: Ω(z) = λ||z||₁ to the loss function
Effect: Penalizes the sum of absolute values of latent activations
Result: Forces most latent neurons to have zero or near-zero activations

Alternative: KL Divergence Penalty:

Encourages activation patterns similar to human brain neurons
Makes representations even sparser and more interpretable

Benefits of Sparsity:

Interpretability: Each active neuron represents a specific, meaningful feature
Efficiency: Only essential neurons are used
Biological realism: Mimics how real brain neurons work (most are inactive)
Feature disentanglement: Different features captured by different neurons

Visual Result:

No regularization: Dense, messy activation patterns
With L1: Clean, sparse patterns with few active neuron

Question 11

Q

What is the Manifold Hypothesis and why is it important for representation learning

Answer

A

The Manifold Hypothesis states that most naturally occurring high-dimensional data lies on or near a much lower-dimensional manifold (curved surface) embedded in the high-dimensional space.

Question 12

Q

How do autoencoders differ from other encoder-decoder architectures like U-Net and SegNet?

Answer

A

Standard Autoencoder:

Structure: Encoder → Bottleneck → Decoder
Information flow: All information must pass through narrow bottleneck
Goal: Learn compressed representations
Trade-off: Compression vs. reconstruction quality

U-Net:

Structure: Encoder-decoder WITH skip connections
Information flow: Direct connections between encoder and decoder at same resolutions
Goal: Preserve fine details for tasks like image segmentation
Benefit: Combines low-level details with high-level semantics

SegNet:

Structure: Encoder-decoder with pooling indices
Information flow: Uses pooling indices to remember spatial locations
Goal: Efficient upsampling for segmentation tasks

Key Differences:

Bottleneck tightness: Autoencoders force tight compression; U-Net/SegNet preserve more information
Use cases: Autoencoders for representation learning; U-Net/SegNet for pixel-level tasks
Information preservation: Skip connections vs. forced compression

Question 13

Q

What is the difference between localist and distributed representations?

Answer

A

Back (Answer)
Localist Representation:

One neuron = one concept/feature
Easy to associate and interpret
Example: One-hot encoding (0,0,1,0,0,0,0,0,0,0) for “horse”
Each position represents a specific category

Distributed Representation:

Multiple neurons work together to represent concepts
Simultaneity in binding - many features active at once
Many-to-many relationship between neurons and concepts
Example: Word embeddings, where meaning emerges from patterns across many dimensions

Key Advantages of Distributed:

More efficient: Can represent exponentially more concepts
Generalizable: Similar concepts have similar patterns
Robust: Damage to few neurons doesn’t destroy representation
Compositional: Can combine features in novel ways

Real-World Examples:

Localist: Traditional categorical encoding, lookup tables
Distributed: Neural network hidden layers, word embeddings, image features

Question 14

Q

What are word embeddings and why are they useful?

Answer

A

Definition:
Word embeddings map words from high-dimensional sparse vectors (like one-hot encoding) to low-dimensional dense vectors that capture semantic meaning.
The Problem They Solve:

Sparse representation: “hotel” and “motel” have completely different one-hot vectors
No semantic relationship: Traditional encoding can’t capture that words are similar

How They Work:

Dense vectors: Each word → continuous vector (e.g., 300 dimensions)
Semantic similarity: Similar words have similar vector representations
Learned relationships: Captures linguistic context and meaning

Famous Example:
vec(Berlin) ≈ vec(Germany) + vec(capital)
vec(queen) ≈ vec(king) - vec(man) + vec(woman)
Key Benefits:

Semantic similarity: Related words cluster together
Compositional: Mathematical operations preserve meaning
Transfer learning: Pre-trained embeddings work across tasks
Efficiency: Dense representation vs. massive vocabulary size

Question 15

Q

What is Latent Semantic Analysis (LSA) and how does it create word embeddings?

Answer

A

LSA (Latent Semantic Analysis) is a method for creating word embeddings by finding latent semantic relationships in text using matrix factorization.
How It Works:

Create term-document matrix: Count word frequencies across documents (like TF-IDF)
Apply SVD: Use Singular Value Decomposition to find low-rank approximation
Extract embeddings: Use the factorized matrices as word and document representations

Key Concepts:

LSI (Latent Semantic Indexing): Same technique, focuses on information retrieval
Context window: For LSA, context can be local neighborhood of words instead of full documents
Cosine similarity: Used to find similar words/documents in the embedding space

Example Results:
When searching for “dog” with different context windows:

h=2: cat, horse, fox, pet, rabbit, pig, animal…
h=30: kennel, puppy, pet, bitch, terrier, rottweiler…

Connection to Modern Methods:
LSA is a precursor to modern word embeddings (Word2Vec, GloVe) - it learns semantic relationships but uses classical matrix factorization instead of neural networks.

Question 16

Q

What is Word2Vec and what are the two main architectures (CBOW vs Skip-gram)?

Answer

Study These Flashcards

A

Word2Vec is a neural network method for learning word embeddings that represent words as dense vectors in a semantic space where similar words are close together.
Two Main Architectures:
CBOW (Continuous Bag-of-Words):

Input: Context words around a target
Output: Predict the missing target word
Question: “Given context words, what missing word is likely to appear?”
Faster to train, better for frequent words

Skip-gram:

Input: Single target word
Output: Predict surrounding context words
Question: “Given this word, what other words are likely to appear nearby?”
Better for rare words/phrases, works well with small data

Key Innovation:

Shallow neural networks (not deep) with efficient training
Latent code: Low-dimensional dense vectors capture semantic relationships
Context window: Uses local word neighborhoods to learn meaning

Semantic Properties:

Similar words cluster together
Captures relationships: king - man + woman ≈ queen
Enables arithmetic in word space

Week 6: Representation Learning Flashcards

(16 cards)