hoai_exam2024_retry02 Flashcards

(40 cards)

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
A

b and c are correct

b. … is, e.g., a polynomial of degree 1.
Linear regression models describe relationships as linear equations, which are polynomials of degree 1 (e.g., ).

c. … has a closed-form solution.
Linear regression can be solved with a closed-form solution using the Normal Equation () when the design matrix  has full rank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A

Correct answer: c

c. … may increase the validation performance.
Dropout is a regularization technique that helps prevent overfitting by randomly dropping out neurons during training. This can lead to improved generalization and better validation performance.

Explanation of other options:
* a. … is only used during validation time.
Incorrect: Dropout is not used during validation or testing. It is only applied during training. During validation/testing, all neurons are used, and their outputs are scaled accordingly.
* b. … is always used during training as well as validation time.
Incorrect: Dropout is applied only during training, not during validation/testing.
* d. … increases the validation loss in order to decrease the training loss.
Incorrect: Dropout aims to increase training loss by introducing noise during training to make the model more robust, which often decreases the validation loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A

Correct answers: a and b.

a. … dynamically control the learning rate.
Learning rate schedules adjust the learning rate during training based on a predefined strategy, such as reducing it after certain epochs or when validation performance plateaus.

b. … might improve the speed at which the network learns.
By reducing the learning rate at the right time, learning rate schedules can help the network converge faster and more effectively to a good solution.

Explanation of other options:
* c. … guarantee to find the global minimum of the loss.
Incorrect: Learning rate schedules improve optimization but do not guarantee finding the global minimum, especially in non-convex loss surfaces like those in neural networks.
* d. … cannot be applied in fully-connected neural networks.
Incorrect: Learning rate schedules can be applied to any type of neural network, including fully connected networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A

Correct answers: b and d.

b. … allow to create deeper neural networks while maintaining trainability.
Residual connections help address the vanishing gradient problem by allowing gradients to flow more easily through the network, making it feasible to train much deeper networks.

d. … create shortcuts for gradients.
Residual connections provide shortcut paths that bypass one or more layers, enabling gradients to flow directly back during backpropagation, thus improving gradient propagation.

Explanation of other options:
* a. … can only be used in convolution neural networks.
Incorrect: Residual connections can be used in various types of neural networks, not just convolutional ones. They are a general concept applicable wherever a deep architecture is used.
* c. … reduce the spatial size of the input.
Incorrect: Residual connections do not alter the spatial size of the input. Instead, they add the input back to the output of a layer (or block).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
A

Correct answers: b, c, and d.

b. Automatic differentiation.
Frameworks like PyTorch provide automatic differentiation, which simplifies calculating gradients for backpropagation during neural network training.

c. Easy switching of computations between CPU and GPU.
PyTorch makes it simple to move tensors and computations between CPU and GPU by using .to(device) or .cuda() methods.

d. Straightforward construction of neural networks.
PyTorch’s dynamic computation graph and torch.nn module allow for an intuitive and flexible way to build and experiment with neural networks.

Explanation of the incorrect option:
* a. Developed without any influence by industry.
Incorrect: PyTorch was developed by Facebook’s AI Research (FAIR) team and has strong industry influence. This influence has contributed to its wide adoption and practical design.

19
Q
A

Correct answers: a and c.

Here are the correct answers:

a. A stride of 2 affects the amount of performed convolution calculations.
A stride of 2 means the kernel is applied every two steps, so fewer convolutions are performed compared to a stride of 1. This results in a reduction of the output size and fewer calculations.

c. The kernel is moved across the input by a step size of 2.
A stride of 2 means that the convolutional kernel moves across the input feature map with a step size of 2 in both the horizontal and vertical directions.

Explanation of the incorrect options:
* b. A stride of 2 is the same as max pooling with a size of 2.
Incorrect: While both operations reduce the spatial dimensions of the input, they are not the same. Max pooling selects the maximum value in each region, while a convolution with a stride of 2 performs a convolution operation, which is different from pooling.
* d. A stride of 2 is the maximum possible value.
Incorrect: The stride can be greater than 2, though typical values are 1 or 2 in most architectures. Larger strides (e.g., 3, 4) are also used in some cases.

25
26
Correct answer: b. b. In classification, the target values are class labels. In classification, the goal is to assign input data to one of several predefined categories or classes, and the target values represent those class labels. **Explanation of the incorrect options:** a. In classification, the target values can be numbers. * Incorrect: While the target values can be numeric in some cases (e.g., for ordinal classification or in certain multi-class setups), classification typically deals with categorical values or class labels, not continuous numbers. c. In classification, the target values are used to predict the corresponding input values. * Incorrect: In classification, the input values (features) are used to predict the target class labels. The target values are the output, not what is predicted from the input. d. In classification, the target values must be between 0 and 1. * Incorrect: This is not a requirement for classification. The target values can be any set of discrete class labels (e.g., 0, 1, 2 for a 3-class problem), not necessarily limited to values between 0 and 1.
27
Correct answers: c and d. c. The softmax, instead of the logistic function, can be used if more than 2 different target classes exist. This is correct. When dealing with multi-class classification (more than two classes), the softmax function is used to convert the outputs into a probability distribution across multiple classes. For binary classification, the logistic (sigmoid) function is typically used. d. Logistic regression is a regression model because it estimates the probability of class membership. This is also correct. Despite its name, logistic regression is used for classification, but it is still classified under regression models because it predicts continuous probabilities (between 0 and 1) rather than discrete class labels. It estimates the probability of class membership in a binary classification setting. Explanation of the incorrect options: * a. Logistic regression always applies a softmax function on top of a linear regression model. Incorrect: For binary classification, logistic regression uses the logistic (sigmoid) function, not softmax. Softmax is used in multi-class problems, as mentioned in option c. * b. Linear regression is a regression model because it estimates the probability of class membership. Incorrect: Linear regression is used for predicting continuous numerical values, not probabilities of class membership. This makes it a regression model, but not one that estimates class membership probabilities. That’s the role of logistic regression.
28
29
Correct answers: a, b, c, and d. All statements are true. a. CNNs take advantage of the “local structure” in image data (neighboring pixels are often highly correlated). This is correct. CNNs are designed to exploit the local structure of images, where neighboring pixels are often highly correlated. Convolutional layers use local receptive fields to capture these local dependencies. b. In a CNN, a hidden neuron is only connected to a few neurons in the previous layer. This is correct. In a CNN, each neuron in a hidden layer is only connected to a small local region of the previous layer, rather than all neurons in the layer (which is what happens in fully connected layers). This is a key feature of CNNs that allows them to efficiently learn spatial hierarchies in image data. c. Weight sharing is an essential part in CNNs. This is correct. Weight sharing refers to the practice of using the same set of weights for different regions of the input. This reduces the number of parameters and helps the network generalize better. d. CNNs can be configured to reduce the computational load as well as memory requirements. This is correct. CNNs are designed to be computationally efficient. Techniques like pooling, weight sharing, and using smaller receptive fields help reduce the computational load and memory requirements compared to fully connected networks.
30
Correct answer: a. a. ERM is based on minimizing the empirical risk on a fixed dataset. This is true. Empirical Risk Minimization (ERM) is the process of minimizing the average loss (empirical risk) over a given training dataset. The goal is to find a model that minimizes the error on this dataset. Explanation of the incorrect options: * b. ERM is a method of estimating the generalization error/risk. Incorrect: ERM minimizes the empirical risk on the training dataset, but it does not directly estimate the generalization error (the error on unseen data). Generalization error is estimated using techniques like cross-validation or by evaluating the model on a separate test set. * c. ERM is typically performed on a dedicated test set. Incorrect: ERM is performed on the training set, not the test set. The test set is used for model evaluation after training, not for training itself. * d. ERM is a form of gradient descent. Incorrect: While ERM involves optimization to minimize the empirical risk, it is not inherently tied to gradient descent. ERM can be solved using various optimization techniques, and gradient descent is just one method used for minimizing the empirical risk.
31
32
Correct answers: a and c a. Loss functions can have an impact on the training process. This is true. The choice of loss function affects how the model is optimized during training, influencing convergence, performance, and the final trained model. c. Loss functions are used to measure the difference between a model prediction and the true target. This is true. A loss function quantifies how well the model’s predictions align with the true target values. It calculates the “error” between the predicted and actual values, guiding the optimization process. Explanation of the incorrect options: * b. The output of loss functions is in the range [0, 1]. Incorrect: The output of a loss function is not necessarily in the range [0, 1]. It can take any value depending on the type of loss function. For example, the mean squared error loss function can produce values much larger than 1. * d. Loss functions are used to obtain the final model prediction. Incorrect: Loss functions are used during training to guide the optimization of the model’s parameters. The final model prediction is obtained by passing inputs through the trained model, not by directly using the loss function.
33
Correct answers: a, c, and d. a. It is used in logistic regression. This is true. The logistic (sigmoid) function is a key part of logistic regression, where it is used to map the linear combination of inputs to a probability value between 0 and 1, representing class membership. c. It yields values in the range [0, 1]. This is true. The logistic function outputs values between 0 and 1, which makes it suitable for probability estimation, as it maps any input to a value within this range. d. It introduces non-linearity. This is true. The logistic function introduces non-linearity by transforming a linear input into a nonlinear output, which helps models like logistic regression to capture more complex relationships. Explanation of the incorrect option: * b. It is used in linear regression. Incorrect: Linear regression does not use the logistic function. It predicts continuous values without applying a nonlinear transformation like the sigmoid function.
34
35
36
37
Correct answers: a and d. a. The receptive field is closely related to the terms “kernel” or “filter”. This is true. The receptive field refers to the region of the input that affects the output of a neuron. The kernel (or filter) slides over the input, and the receptive field is determined by the size of the kernel and the strides of the convolution. d. The receptive field is the (part of the) input that is connected to a node/neuron. This is true. In a CNN, the receptive field is the part of the input that contributes to the activation of a neuron. It’s the region of the input that the neuron “sees” or is connected to, based on the convolutional layer. Explanation of the incorrect options: * b. The receptive field always remains constant throughout the depth of the network. Incorrect: The receptive field increases as we move deeper into the network. This happens because each subsequent layer can “see” a larger portion of the input due to the cumulative effect of the convolutional kernels. * c. The receptive field is often bigger than the original input size. Incorrect: The receptive field typically refers to the part of the input that a neuron is connected to in a given layer, and it usually does not exceed the original input size, especially in the early layers. However, as you go deeper into the network, the effective receptive field can grow larger due to the cumulative effect of multiple layers.
38
39
Correct answers: a, b, and d. a. iteration This is related to gradient descent. In gradient descent, an iteration refers to a single update of the model’s parameters based on the gradient of the loss function with respect to the parameters. b. momentum This is related to gradient descent. Momentum is an optimization technique used to accelerate gradient descent by adding a fraction of the previous update to the current one, helping to smooth out oscillations and speeding up convergence. d. learning rate This is related to gradient descent. The learning rate is a hyperparameter that controls the size of the steps taken towards the minimum during each iteration of gradient descent. Explanation of the incorrect option: * c. validation size Incorrect: Validation size refers to the portion of the data set used for validation and is not directly related to gradient descent. It is used to evaluate the model’s performance during training, but it doesn’t affect the gradient descent process itself.
40
a, b, c, and d are all correct. a. Data augmentation parameters cannot typically be transferred/copied to other tasks and scenarios. This statement is actually true in many cases. While data augmentation techniques themselves (such as rotation, flipping, scaling) can be transferred across tasks, the specific parameters (such as the magnitude of rotation, the scale of transformations, or the amount of noise added) often need to be adjusted for each task or dataset. For example, augmenting images of faces might require different rotation and scaling parameters than augmenting images of handwritten digits, as the former is more sensitive to orientation. b. Using incorrect parameters might just yield duplicate samples without any modification. This is true. If the data augmentation parameters (such as rotation, scaling, etc.) are set incorrectly, the augmentation might not modify the data enough, resulting in duplicate or nearly identical samples, which wouldn’t improve model generalization. c. Extreme data augmentation might introduce heavy artifacts. This is true. Over-aggressive data augmentation, such as extreme rotations, cropping, or scaling, can distort the data so much that it introduces artifacts (unrealistic or unnatural features), which can negatively affect the model’s ability to learn meaningful patterns. d. Using data augmentation techniques that are not suitable for the task can be disadvantageous. This is true. Using augmentation techniques that are inappropriate for the type of data (for example, rotating images of digits in a classification task where orientation is important) can confuse the model and hinder performance rather than help it.