Deep Learning 1 Flashcards
- What is ‘Deep Learning,’ and how does it fit within Artificial Intelligence and Machine Learning?
Deep Learning is a subfield of Machine Learning that uses neural networks with multiple layers to learn feature representations directly from data. In the broader scope: AI > ML > Deep Learning. AI mimics human behavior, ML learns patterns from data, and Deep Learning specifically relies on multilayer (deep) neural networks.
- Why has Deep Learning risen to prominence now, even though neural networks date back decades?
Three main factors: (1) Big Data availability (we can store and collect large datasets more easily), (2) Hardware advances (especially GPUs that allow parallelizable matrix operations), and (3) Improved algorithms and software (new techniques, better models, robust libraries). These developments have reignited the success of neural networks.
- What challenge does hand-engineering features present, and how does Deep Learning address it?
Manually crafting features is time-consuming and not easily scalable to new tasks or data. Deep Learning networks automatically learn underlying features directly from raw inputs, avoiding brittle hand-engineered representations.
- What is the Perceptron, and why is it often called the structural building block of deep learning?
The Perceptron is a simple unit that computes a weighted sum of inputs and applies a nonlinear activation function. It is the foundational element of neural networks, as multiple Perceptron-like units (neurons) can be stacked to form deep architectures.
- Describe the forward propagation steps of a single Perceptron (with bias included).
1) Multiply each input xᵢ by its corresponding weight wᵢ. 2) Sum those products plus a bias term b. 3) Apply a nonlinear activation function to the sum. The result is the Perceptron’s output.
- Why are nonlinear activation functions necessary in neural networks?
Without nonlinearity, stacked layers would collapse into an equivalent single linear transformation. Nonlinear activations enable the network to approximate arbitrarily complex functions, giving the model real expressive power.
- Give three common activation functions and a short note on each.
(1) Sigmoid: σ(z)=1/(1+e^(-z)), saturates for large |z|, used historically.\n(2) Tanh: tanh(z), zero-centered, saturates similarly to sigmoid.\n(3) ReLU: max(0,z), avoids saturation for positive z, widely used for faster training.
- In a 2D Perceptron example, how is the decision boundary represented by the equation 1 + 3x₁ - 2x₂ = 0?
This equation translates to a line in 2D. For input (x₁, x₂), if 1 + 3x₁ - 2x₂ = 0, it is on the boundary. If > 0, the network outputs a value > 0.5 (assuming a sigmoid), classifying one way; if < 0, the output < 0.5, classifying another way.
- How do we build a Multi-Output Perceptron, and why is it useful?
We have multiple output neurons in parallel, each taking the same input but with its own set of weights and biases. This can produce multiple values simultaneously (e.g., in multi-class tasks or multi-dimensional outputs).
- What is the key difference between a Single-Layer Neural Network and a Deep Neural Network?
A single-layer network has just one layer of perceptrons (the output layer). A deep network has multiple hidden layers of perceptrons stacked before the output layer, allowing the model to learn more abstract representations.
- Provide an example of a small feed-forward deep network in pseudocode using a deep learning framework.
For instance, using Keras:\n
python\ninputs = Input(shape=(m,))\nhidden = Dense(d1, activation='relu')(inputs)\noutputs = Dense(2, activation='softmax')(hidden)\nmodel = Model(inputs, outputs)\n
- When building neural networks, how do we typically handle the final output layer for classification vs. regression?
For classification (binary or multi-class), we often use an output layer with a sigmoid or softmax activation to produce probabilities. For regression, we often have a linear output (no activation) or something that directly predicts continuous values.
- In the class attendance example, how might a neural network decide whether someone will pass or fail?
The network takes two inputs: (1) number of lectures attended, and (2) hours spent on final project. It then processes these inputs through hidden layers and outputs a probability of ‘pass’ vs. ‘fail.’ A threshold (e.g., 0.5) could determine the final decision.
- What is the purpose of a loss function in a neural network, and how does it relate to predictions vs. ground truth labels?
The loss function quantifies the difference (or ‘cost’) between the model’s predictions and the actual labels in the dataset. Minimizing this loss guides the network’s weights to improve future predictions.
- Explain the idea of Empirical Loss and why it is also called the empirical risk or cost function.
Empirical loss sums (or averages) the individual losses over all training examples. It’s called ‘empirical’ because it’s an estimate of the true expected loss, calculated only on the finite training samples we have.
- When do we typically use Binary Cross Entropy Loss vs. Mean Squared Error (MSE)?
Binary Cross Entropy is used when the model outputs probabilities for binary classification (values between 0 and 1). MSE is common for regression tasks where outputs are real-valued. Cross Entropy better suits probability-based classification, while MSE suits continuous predictions.
- What does training a neural network amount to, mathematically speaking, in terms of loss optimization?
It’s an optimization problem where we seek w* = argmin ᵂ [1/N Σᵢ Loss(yᵢ, fᵂ(xᵢ))]. In other words, we find the weights that minimize the total (or average) loss on the training data.
- Describe the general steps of Gradient Descent. What does the learning rate control?
Steps: (1) Initialize weights randomly, (2) Calculate gradient of loss wrt weights, (3) Update weights by moving a small step in opposite direction of the gradient, (4) Repeat until convergence. The learning rate determines how large each update step is; too large can diverge, too small can slow convergence.
- Why is computing the gradient of the loss with respect to every weight often called backpropagation in deep neural networks?
Because gradients for higher-layer weights depend on partial derivatives in later layers, we propagate error signals (and thus gradient information) backward from the final output to earlier layers, applying the chain rule repeatedly.
- Summarize the simple numeric chain rule example used to illustrate backpropagation in the slides (x=-2, y=5, z=-4).
In that example, we have a small computational graph. We compute forward (some function of x, y, z) to get the output. Then we apply the chain rule to get partial derivatives wrt x, y, z. That chain rule demonstration shows how each local derivative accumulates to find the overall gradient for each variable.
- What makes neural network loss landscapes notoriously hard to optimize using gradient descent?
These landscapes can have many local minima, saddle points, and flat regions. High-dimensional parameter spaces produce complicated surfaces, requiring careful tuning of hyperparameters like the learning rate to escape poor local minima or plateaus.
- Why do we often need to experiment with different learning rates, or use an adaptive algorithm, when training deep networks?
Because a fixed rate can be too small (leading to slow convergence) or too large (causing divergence). Adaptive methods (e.g., Adam, RMSProp) adjust the step size based on gradients and past updates, often improving speed and stability of learning.
- List some popular adaptive learning rate algorithms and a short reason we might use them.
(1) Momentum: adds velocity for smoother progress.\n(2) Adagrad: adapts learning rates based on frequency of parameters.\n(3) RMSProp: scales learning rates by moving average of squared gradients.\n(4) Adam: combines momentum and RMSProp ideas.\nThese can handle complex, noisy, or sparse gradients better than plain SGD.
- How does Stochastic Gradient Descent (SGD) differ from full-batch Gradient Descent, and why is it often used in practice?
SGD updates weights using the gradient from a single (or small batch of) training instance(s), rather than the entire dataset. This speeds up computation (useful for large datasets) and introduces some helpful noise that can avoid certain local minima, often improving generalization.