Deep Learning 1 Flashcards by Rasmus Johansson

What is ‘Deep Learning,’ and how does it fit within Artificial Intelligence and Machine Learning?

Deep Learning is a subfield of Machine Learning that uses neural networks with multiple layers to learn feature representations directly from data. In the broader scope: AI > ML > Deep Learning. AI mimics human behavior, ML learns patterns from data, and Deep Learning specifically relies on multilayer (deep) neural networks.

How well did you know this?

Not at all

Perfectly

Why has Deep Learning risen to prominence now, even though neural networks date back decades?

Three main factors: (1) Big Data availability (we can store and collect large datasets more easily), (2) Hardware advances (especially GPUs that allow parallelizable matrix operations), and (3) Improved algorithms and software (new techniques, better models, robust libraries). These developments have reignited the success of neural networks.

How well did you know this?

Not at all

Perfectly

What challenge does hand-engineering features present, and how does Deep Learning address it?

Manually crafting features is time-consuming and not easily scalable to new tasks or data. Deep Learning networks automatically learn underlying features directly from raw inputs, avoiding brittle hand-engineered representations.

How well did you know this?

Not at all

Perfectly

What is the Perceptron, and why is it often called the structural building block of deep learning?

The Perceptron is a simple unit that computes a weighted sum of inputs and applies a nonlinear activation function. It is the foundational element of neural networks, as multiple Perceptron-like units (neurons) can be stacked to form deep architectures.

How well did you know this?

Not at all

Perfectly

Describe the forward propagation steps of a single Perceptron (with bias included).

1) Multiply each input xᵢ by its corresponding weight wᵢ. 2) Sum those products plus a bias term b. 3) Apply a nonlinear activation function to the sum. The result is the Perceptron’s output.

How well did you know this?

Not at all

Perfectly

Why are nonlinear activation functions necessary in neural networks?

Without nonlinearity, stacked layers would collapse into an equivalent single linear transformation. Nonlinear activations enable the network to approximate arbitrarily complex functions, giving the model real expressive power.

How well did you know this?

Not at all

Perfectly

Give three common activation functions and a short note on each.

(1) Sigmoid: σ(z)=1/(1+e^(-z)), saturates for large |z|, used historically.\n(2) Tanh: tanh(z), zero-centered, saturates similarly to sigmoid.\n(3) ReLU: max(0,z), avoids saturation for positive z, widely used for faster training.

How well did you know this?

Not at all

Perfectly

In a 2D Perceptron example, how is the decision boundary represented by the equation 1 + 3x₁ - 2x₂ = 0?

This equation translates to a line in 2D. For input (x₁, x₂), if 1 + 3x₁ - 2x₂ = 0, it is on the boundary. If > 0, the network outputs a value > 0.5 (assuming a sigmoid), classifying one way; if < 0, the output < 0.5, classifying another way.

How well did you know this?

Not at all

Perfectly

How do we build a Multi-Output Perceptron, and why is it useful?

We have multiple output neurons in parallel, each taking the same input but with its own set of weights and biases. This can produce multiple values simultaneously (e.g., in multi-class tasks or multi-dimensional outputs).

How well did you know this?

Not at all

Perfectly

What is the key difference between a Single-Layer Neural Network and a Deep Neural Network?

A single-layer network has just one layer of perceptrons (the output layer). A deep network has multiple hidden layers of perceptrons stacked before the output layer, allowing the model to learn more abstract representations.

How well did you know this?

Not at all

Perfectly

Provide an example of a small feed-forward deep network in pseudocode using a deep learning framework.

For instance, using Keras:\n

python\ninputs = Input(shape=(m,))\nhidden = Dense(d1, activation='relu')(inputs)\noutputs = Dense(2, activation='softmax')(hidden)\nmodel = Model(inputs, outputs)\n

How well did you know this?

Not at all

Perfectly

When building neural networks, how do we typically handle the final output layer for classification vs. regression?

For classification (binary or multi-class), we often use an output layer with a sigmoid or softmax activation to produce probabilities. For regression, we often have a linear output (no activation) or something that directly predicts continuous values.

How well did you know this?

Not at all

Perfectly

In the class attendance example, how might a neural network decide whether someone will pass or fail?

The network takes two inputs: (1) number of lectures attended, and (2) hours spent on final project. It then processes these inputs through hidden layers and outputs a probability of ‘pass’ vs. ‘fail.’ A threshold (e.g., 0.5) could determine the final decision.

How well did you know this?

Not at all

Perfectly

What is the purpose of a loss function in a neural network, and how does it relate to predictions vs. ground truth labels?

The loss function quantifies the difference (or ‘cost’) between the model’s predictions and the actual labels in the dataset. Minimizing this loss guides the network’s weights to improve future predictions.

How well did you know this?

Not at all

Perfectly

Explain the idea of Empirical Loss and why it is also called the empirical risk or cost function.

Empirical loss sums (or averages) the individual losses over all training examples. It’s called ‘empirical’ because it’s an estimate of the true expected loss, calculated only on the finite training samples we have.

How well did you know this?

Not at all

Perfectly

When do we typically use Binary Cross Entropy Loss vs. Mean Squared Error (MSE)?

Binary Cross Entropy is used when the model outputs probabilities for binary classification (values between 0 and 1). MSE is common for regression tasks where outputs are real-valued. Cross Entropy better suits probability-based classification, while MSE suits continuous predictions.

How well did you know this?

Not at all

Perfectly

What does training a neural network amount to, mathematically speaking, in terms of loss optimization?

Study These Flashcards

It’s an optimization problem where we seek w* = argmin ᵂ [1/N Σᵢ Loss(yᵢ, fᵂ(xᵢ))]. In other words, we find the weights that minimize the total (or average) loss on the training data.

Describe the general steps of Gradient Descent. What does the learning rate control?

Study These Flashcards

Steps: (1) Initialize weights randomly, (2) Calculate gradient of loss wrt weights, (3) Update weights by moving a small step in opposite direction of the gradient, (4) Repeat until convergence. The learning rate determines how large each update step is; too large can diverge, too small can slow convergence.

Why is computing the gradient of the loss with respect to every weight often called backpropagation in deep neural networks?

Study These Flashcards

Because gradients for higher-layer weights depend on partial derivatives in later layers, we propagate error signals (and thus gradient information) backward from the final output to earlier layers, applying the chain rule repeatedly.

Summarize the simple numeric chain rule example used to illustrate backpropagation in the slides (x=-2, y=5, z=-4).

Study These Flashcards

In that example, we have a small computational graph. We compute forward (some function of x, y, z) to get the output. Then we apply the chain rule to get partial derivatives wrt x, y, z. That chain rule demonstration shows how each local derivative accumulates to find the overall gradient for each variable.

What makes neural network loss landscapes notoriously hard to optimize using gradient descent?

Study These Flashcards

These landscapes can have many local minima, saddle points, and flat regions. High-dimensional parameter spaces produce complicated surfaces, requiring careful tuning of hyperparameters like the learning rate to escape poor local minima or plateaus.

Why do we often need to experiment with different learning rates, or use an adaptive algorithm, when training deep networks?

Study These Flashcards

Because a fixed rate can be too small (leading to slow convergence) or too large (causing divergence). Adaptive methods (e.g., Adam, RMSProp) adjust the step size based on gradients and past updates, often improving speed and stability of learning.

List some popular adaptive learning rate algorithms and a short reason we might use them.

Study These Flashcards

(1) Momentum: adds velocity for smoother progress.\n(2) Adagrad: adapts learning rates based on frequency of parameters.\n(3) RMSProp: scales learning rates by moving average of squared gradients.\n(4) Adam: combines momentum and RMSProp ideas.\nThese can handle complex, noisy, or sparse gradients better than plain SGD.

How does Stochastic Gradient Descent (SGD) differ from full-batch Gradient Descent, and why is it often used in practice?

Study These Flashcards

SGD updates weights using the gradient from a single (or small batch of) training instance(s), rather than the entire dataset. This speeds up computation (useful for large datasets) and introduces some helpful noise that can avoid certain local minima, often improving generalization.

25. What is a ‘mini-batch,’ and what advantages does it bring compared to single-sample (purely stochastic) or full-batch training?

A mini-batch is a subset of the data (e.g., 32 or 64 samples) used each step to compute an average gradient. It offers a balanced trade-off: more stable and accurate gradient estimates than single-sample, but faster updates than full-batch. It also allows efficient GPU parallelization.

26. What is the primary risk in training deep networks regarding the generalization to unseen data?

Overfitting. This occurs when a network memorizes the training data but fails to capture underlying patterns, leading to poor performance on new, unseen data.

27. How do we spot overfitting when looking at training vs. testing loss?

During training, we see training loss continually decreasing, but at some point, the testing (or validation) loss begins to rise instead of decreasing. That divergence indicates overfitting—our model is losing generalization ability.

28. Give two major regularization strategies that help reduce overfitting in neural networks.

(1) Dropout: randomly drops activations in a layer during training, forcing the model not to rely too heavily on any single neuron. (2) Early Stopping: monitors validation loss and stops training when overfitting becomes evident (validation loss stops decreasing and starts to rise).

29. How exactly does dropout work, and what is a typical dropout rate?

In each training forward pass, we randomly set a portion (often 50%) of neuron outputs in a given layer to zero. This prevents co-adaptation of neurons, forcing robust feature learning. The dropout rate typically lies between 0.3 and 0.5 in many architectures, but can vary by task.

30. Explain how Early Stopping prevents overfitting and how we decide when to stop.

We track validation loss while training. If validation loss stops decreasing (or starts to increase) after some patience, we stop training. This prevents the network from fitting noise in the training set that would hurt test-time performance.

31. Compare Underfitting vs. Overfitting. Where does the ‘ideal fit’ lie?

Underfitting: model capacity is too low or training is insufficient, so it fails to capture essential patterns (both training and testing performance is poor). Overfitting: model fits the training data too closely at the expense of generalization (training error is very low, but validation error is high). The ideal fit sits between these extremes—low training error and good validation/test performance.

32. In backpropagation, describe how the chain rule is applied across multiple layers in a deep network.

We start from the final layer’s loss derivative. Then for each layer L in reverse order, we compute the local derivative of that layer’s output with respect to its inputs/weights, multiply by the chain of derivatives from previous (i.e., next in forward pass) layers, and pass that gradient to the next layer backward.

33. What is the difference between the terms ‘objective function,’ ‘cost function,’ and ‘empirical risk’?

They all refer to the same concept: the function we aim to minimize during training, typically the average (or sum) of loss values over the training set. Different texts use these terms interchangeably to mean the metric capturing how poorly the model performs.

34. Why is the final step in gradient descent typically weight ← weight - η * (dLoss/dWeight)?

Because the gradient gives the direction of steepest ascent in the loss surface. We want to move in the opposite direction of that gradient to reduce the loss. The learning rate η scales how big that step is.

35. In practice, how do GPU accelerations help with the matrix operations used in neural networks?

GPUs excel at parallelizing large-scale vector and matrix operations, which are central to neural network forward passes and backpropagation steps. By distributing computations across thousands of GPU cores, we see massive speed-ups in training times for deep nets.

36. Name at least two challenges that can occur if we pick an inappropriate kernel initialization for the network weights.

(1) If weights are too large initially, outputs saturate, causing gradients to vanish or explode.\n(2) If weights are too small, signals remain tiny, and the network barely learns. Proper initialization helps maintain stable variance of activations across layers.

37. What do we gain by using mini-batch gradient descent over purely stochastic (single-sample) gradient descent?

Mini-batches provide a more stable estimate of the gradient, allowing for a smoother convergence path and typically enabling a larger learning rate. Purely single-sample can be too noisy, while full-batch can be extremely slow. Mini-batches strike a good balance.

38. Summarize the concept of Momentum and how it aids gradient-based optimization.

Momentum accumulates a ‘velocity’ term based on past gradients, smoothing out updates so we don’t rapidly change directions every step. It effectively damps oscillations and can speed up convergence, especially in ravine-shaped regions of the loss surface.

39. Explain how L2 Regularization (weight decay) might also help with overfitting, though it wasn’t explicitly in the slides.

L2 adds a penalty term proportional to the sum of squares of the weights, pushing them to stay smaller. Smaller weights reduce model complexity, improving generalization. This is another well-known approach to limit overfitting.

40. Give a concise summary of the ‘Core Foundations’ reviewed: The Perceptron, Neural Networks, and Training in practice.

(1) The Perceptron is a basic unit: inputs → weighted sum → activation → output.\n(2) Neural networks stack many perceptrons and use nonlinear activations to model complex data.\n(3) Training uses backpropagation to compute gradients and gradient-based optimizers (SGD, Adam, etc.) to minimize a loss function.\n(4) Practical strategies like adaptive learning rates, mini-batches, and regularization (dropout, early stopping) are crucial for successful training.

Deep Learning 1 Flashcards

(40 cards)