Week 6: Representation Learning Flashcards
(16 cards)
What is a latent component?
A latent component is a hidden feature or pattern in data that you cannot directly observe but explains important characteristics of the data.
What is the curse of dimensionality and how can we address it?
The curse of dimensionality is the problem that occurs when data has too many features/variables - as dimensions increase, data becomes increasingly sparse and distances between points become less meaningful.
Dimensionality Reduction: Remove unimportant features (PCA, LDA, ICA)
Feature Selection: Keep only the most relevant features
Data Compression: Use autoencoders to learn compressed representations
What is transformation by regression in the context of representation learning?
Transformation by regression is learning a parametric model that maps input data to output values through supervised learning.
Key Points:
Classification: Maps input x to discrete class y
Regression: Maps input x to continuous value y
Goal: Learn from labeled data to predict outputs for new inputs
Method: Supervised machine learnin
What is transformation by correlation (PCA) and how does it work?
Transformation by correlation refers to Principal Component Analysis (PCA) - a linear, non-parametric approach to reduce dimensionality by finding directions of maximum variance.
How It Works:
Find correlations in the data
Create new coordinate system where:
First axis = direction of greatest variance (first eigenvector)
Second axis = second greatest variance
And so on…
Keep only first few components that explain most variance
Ignore components with small eigenvalues (low variance)
What is Independent Component Analysis (ICA) and how does it differ from PCA?
Key Difference from PCA:
PCA: Finds orthogonal components with maximum variance
ICA: Finds statistically independent components (non-Gaussian sources)
What is a bottleneck in an autoencoder and why is it important?
Definition:
A bottleneck is a narrow hidden layer in an autoencoder that forces the network to compress information into a much smaller latent representation.
What is transposed convolution and why do we use it in autoencoders?
Transposed convolution (also called “deconvolution”) is the reverse operation of regular convolution. It upsamples feature maps to make them larger.
What is an autoencoder and how does it work?
An autoencoder is a neural network that learns to compress data into a latent representation and then reconstruct the original data from that representation.
Architecture:
Encoder: Compresses input x ∈ ℝᴰ → latent code z ∈ ℝᴸ (where L < D)
Decoder: Reconstructs latent code z → output r(x) ∈ ℝᴰ
Training:
Loss function: Reconstruction error L(θ) = ||r(x) - x||²
Goal: Minimize the difference between input and reconstructed output
Learning: Network learns meaningful latent representations automatically
What is a Denoising Autoencoder and why is it useful?
A Denoising Autoencoder (DAE) is an autoencoder trained to reconstruct clean data from corrupted/noisy input.
How It Works:
Add noise to clean input data (Gaussian noise, dropout, etc.)
Feed noisy input to the autoencoder
Train to reconstruct the original clean version
Loss function: ||clean_original - reconstruction||²
Types of Noise:
Gaussian noise: Random pixel values added
Bernoulli/Dropout noise: Random pixels set to zero
Salt & pepper noise: Random black/white pixels
Why It’s Powerful:
Robustness: Learns features that work even with corruption
Better generalization: Prevents overfitting to exact pixel values
Meaningful features: Forces network to learn underlying patterns, not noise
Regularization effect: Acts as implicit regularization
Key Insight:
By learning to “clean up” noisy data, the network learns more robust and meaningful representations of the underlying data structure.
What is a Sparse Autoencoder and how does L1 regularization create sparsity?
Definition:
A Sparse Autoencoder uses regularization to ensure that only a few neurons in the latent layer are active at any time, creating sparse representations.
L1 Regularization Method:
Add penalty: Ω(z) = λ||z||₁ to the loss function
Effect: Penalizes the sum of absolute values of latent activations
Result: Forces most latent neurons to have zero or near-zero activations
Alternative: KL Divergence Penalty:
Encourages activation patterns similar to human brain neurons
Makes representations even sparser and more interpretable
Benefits of Sparsity:
Interpretability: Each active neuron represents a specific, meaningful feature
Efficiency: Only essential neurons are used
Biological realism: Mimics how real brain neurons work (most are inactive)
Feature disentanglement: Different features captured by different neurons
Visual Result:
No regularization: Dense, messy activation patterns
With L1: Clean, sparse patterns with few active neuron
What is the Manifold Hypothesis and why is it important for representation learning
The Manifold Hypothesis states that most naturally occurring high-dimensional data lies on or near a much lower-dimensional manifold (curved surface) embedded in the high-dimensional space.
How do autoencoders differ from other encoder-decoder architectures like U-Net and SegNet?
Standard Autoencoder:
Structure: Encoder → Bottleneck → Decoder
Information flow: All information must pass through narrow bottleneck
Goal: Learn compressed representations
Trade-off: Compression vs. reconstruction quality
U-Net:
Structure: Encoder-decoder WITH skip connections
Information flow: Direct connections between encoder and decoder at same resolutions
Goal: Preserve fine details for tasks like image segmentation
Benefit: Combines low-level details with high-level semantics
SegNet:
Structure: Encoder-decoder with pooling indices
Information flow: Uses pooling indices to remember spatial locations
Goal: Efficient upsampling for segmentation tasks
Key Differences:
Bottleneck tightness: Autoencoders force tight compression; U-Net/SegNet preserve more information
Use cases: Autoencoders for representation learning; U-Net/SegNet for pixel-level tasks
Information preservation: Skip connections vs. forced compression
What is the difference between localist and distributed representations?
Back (Answer)
Localist Representation:
One neuron = one concept/feature
Easy to associate and interpret
Example: One-hot encoding (0,0,1,0,0,0,0,0,0,0) for “horse”
Each position represents a specific category
Distributed Representation:
Multiple neurons work together to represent concepts
Simultaneity in binding - many features active at once
Many-to-many relationship between neurons and concepts
Example: Word embeddings, where meaning emerges from patterns across many dimensions
Key Advantages of Distributed:
More efficient: Can represent exponentially more concepts
Generalizable: Similar concepts have similar patterns
Robust: Damage to few neurons doesn’t destroy representation
Compositional: Can combine features in novel ways
Real-World Examples:
Localist: Traditional categorical encoding, lookup tables
Distributed: Neural network hidden layers, word embeddings, image features
What are word embeddings and why are they useful?
Definition:
Word embeddings map words from high-dimensional sparse vectors (like one-hot encoding) to low-dimensional dense vectors that capture semantic meaning.
The Problem They Solve:
Sparse representation: “hotel” and “motel” have completely different one-hot vectors
No semantic relationship: Traditional encoding can’t capture that words are similar
How They Work:
Dense vectors: Each word → continuous vector (e.g., 300 dimensions)
Semantic similarity: Similar words have similar vector representations
Learned relationships: Captures linguistic context and meaning
Famous Example:
vec(Berlin) ≈ vec(Germany) + vec(capital)
vec(queen) ≈ vec(king) - vec(man) + vec(woman)
Key Benefits:
Semantic similarity: Related words cluster together
Compositional: Mathematical operations preserve meaning
Transfer learning: Pre-trained embeddings work across tasks
Efficiency: Dense representation vs. massive vocabulary size
What is Latent Semantic Analysis (LSA) and how does it create word embeddings?
LSA (Latent Semantic Analysis) is a method for creating word embeddings by finding latent semantic relationships in text using matrix factorization.
How It Works:
Create term-document matrix: Count word frequencies across documents (like TF-IDF)
Apply SVD: Use Singular Value Decomposition to find low-rank approximation
Extract embeddings: Use the factorized matrices as word and document representations
Key Concepts:
LSI (Latent Semantic Indexing): Same technique, focuses on information retrieval
Context window: For LSA, context can be local neighborhood of words instead of full documents
Cosine similarity: Used to find similar words/documents in the embedding space
Example Results:
When searching for “dog” with different context windows:
h=2: cat, horse, fox, pet, rabbit, pig, animal…
h=30: kennel, puppy, pet, bitch, terrier, rottweiler…
Connection to Modern Methods:
LSA is a precursor to modern word embeddings (Word2Vec, GloVe) - it learns semantic relationships but uses classical matrix factorization instead of neural networks.
What is Word2Vec and what are the two main architectures (CBOW vs Skip-gram)?
Word2Vec is a neural network method for learning word embeddings that represent words as dense vectors in a semantic space where similar words are close together.
Two Main Architectures:
CBOW (Continuous Bag-of-Words):
Input: Context words around a target
Output: Predict the missing target word
Question: “Given context words, what missing word is likely to appear?”
Faster to train, better for frequent words
Skip-gram:
Input: Single target word
Output: Predict surrounding context words
Question: “Given this word, what other words are likely to appear nearby?”
Better for rare words/phrases, works well with small data
Key Innovation:
Shallow neural networks (not deep) with efficient training
Latent code: Low-dimensional dense vectors capture semantic relationships
Context window: Uses local word neighborhoods to learn meaning
Semantic Properties:
Similar words cluster together
Captures relationships: king - man + woman ≈ queen
Enables arithmetic in word space