ai Flashcards

(92 cards)

1
Q

Given a convolution layer with input channels 3, output channel 64, kernel size 4x4 and stride 2, dilation 3, padding 1, what are the parameter size of this convolution layer?

A

3x64x4x4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In Pytorch (import torch.nn as nn), which of the following layer downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which following statement is True about convolution layer?

A

Convolution layer is linear and it is often used along with activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the design of auto-encoder, the encoder and decoder should follow the exact same structure.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

All regularizations (e.g., L1 norm, L2 norm) penalize larger parameters.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When updating parameters using gradient descent, which way of calculating loss works better (ie. a better trade-off between efficiency and robustness)?

A

calculate loss for a mini-batch of data examples in every iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

MaxPooling preserves detected features, and downsamples feature map (image)

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the size of receptive field for two stacking dilated convolution layers with kernel size 3x3, stride 1, and dilation 2?

A

9x9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In CNN, two conv layers cannot be connected directly, we must use a pooling layer in the middle.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In the design of CNN, fully connected layer usually contains much more parameters than conv layers.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In the mini-batch SGD training, an important practice is to shuffle the training data before every epoch. Why?

A

It helps the training converge fast and prevents bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic Regression is widely used to solve a classification problem with predicting probabilities of discrete (or categorical) values.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following statements is true about activation functions in the context of neural networks and backpropagation?

A

Activation functions like ReLU (Rectified Linear Unit) introduce non-linear properties to the model, allowing it to learn complex patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which case is overfitting?

A

Training error is low, but testing error is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What approach could be used to handle overfitting?

A

Use regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Besides penalizing larger parameters, which regularization makes parameters more sparse?

A

L1 norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In Backpropagation, which claim is true?

A

Backward-pass uses the information preserved in forward-pass to calculate gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

As an activation function, tanh avoids the vanishing gradient problem.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

As an activation function, ReLU solves the vanishing gradient problem.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

About SGD optimization, which is not correct?

A

Randomly initialize the parameters will affect the performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In reinforcement learning, what is the benefit of using network instead of lookup table?

A

Generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which way do we usually use to train an autoencoder model?

A

We usually train the encoder model and the decoder model together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Which claim is true about attention and self-attention? In the sequence-to-sequence model, at different steps, "attention" lets the model "focus" on different parts of the input. Self-attention is usually used to model dependencies between different parts of one sequence (e.g., words in one sentence). Both the above claims. None of the above claims.
Both the above claims.
26
What's the major purpose of multi-head attention?
Catching multiple relationships between words of the input sequence
27
In the transformer neural network architecture, the encoder blocks usually use the identical neural network structure.
TRUE
28
In the transformer neural network architecture, the output of the final encoder block will go to _____.
Every decoder block
29
In the autoregressive model, the output variable at the current step depends on only the hidden states at all previous steps.
FALSE
30
In Transformer, how does the decoder use the information (features) from the encoder?
Cross-attention (mixed attentions from multiple inputs)
31
In the policy gradient approach for reinforcement learning, the reward R(?^n ) is considered based on
The cumulative reward in every entire trajectory
32
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which is usually more sample-efficient?
Q-learning
33
In Q-Learning, we can use either Q-table or a neural network to predict a Q-value for a pair of (state, action). Which one is more scalable?
Neural network
34
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one uses randomly sampled transitions instead of the entire trajectories?
Q-learning
35
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one is on-policy training?
Policy gradient
36
For Discrete-event modeling, what approach do we often use?
Q-learning
37
What approach do we often use for Discrete-event modeling?
Q-learning
38
In the information theory, which event includes more information?
Low probability event occurs
39
What is the correct statement about activation functions in neural networks and backpropagation?
Activation functions introduce non-linear properties to the model, allowing it to learn complex patterns.
40
What is a common sign of overfitting?
Low training error, high testing error
41
What technique can help to mitigate overfitting?
Regularization
42
Do all regularizations penalize larger parameters?
TRUE
43
What is the primary purpose of using dropout in a CNN?
Preventing overfitting
44
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
45
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
46
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
47
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
48
Does MaxPooling preserve detected features and downsample the feature map?
TRUE
49
Do two convolution layers in a CNN always need a pooling layer in between them?
FALSE
50
Does a fully connected layer in CNN usually contain more parameters than convolution layers?
TRUE
51
What is the primary purpose of using dropout in a CNN?
To prevent overfitting
52
What is the main advantage of using dropout in a CNN?
Preventing overfitting
53
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
54
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
55
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
56
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
57
Does MaxPooling preserve detected features and downsample the feature map?
TRUE
58
Do two convolution layers in a CNN always need a pooling layer in between them?
FALSE
59
Does a fully connected layer in CNN usually contain more parameters than convolution layers?
TRUE
60
What is the primary purpose of using dropout in a CNN?
To prevent overfitting
61
What is the main advantage of using dropout in a CNN?
Preventing overfitting
62
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
63
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
64
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
65
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
66
Does MaxPooling preserve detected features and downsample the feature map?
TRUE
67
Do two convolution layers in a CNN always need a pooling layer in between them?
FALSE
68
Does a fully connected layer in CNN usually contain more parameters than convolution layers?
TRUE
69
What is the primary purpose of using dropout in a CNN?
To prevent overfitting
70
What is the main advantage of using dropout in a CNN?
Preventing overfitting
71
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
72
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
73
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
74
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
75
Does MaxPooling preserve detected features and downsample the feature map?
TRUE
76
Do two convolution layers in a CNN always need a pooling layer in between them?
FALSE
77
Does a fully connected layer in CNN usually contain more parameters than convolution layers?
TRUE
78
What is the primary purpose of using dropout in a CNN?
To prevent overfitting
79
What is the main advantage of using dropout in a CNN?
Preventing overfitting
80
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
81
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
82
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
83
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
84
Does MaxPooling preserve detected features and downsample the feature map?
TRUE
85
Do two convolution layers in a CNN always need a pooling layer in between them?
FALSE
86
Does a fully connected layer in CNN usually contain more parameters than convolution layers?
TRUE
87
What is the primary purpose of using dropout in a CNN?
To prevent overfitting
88
What is the main advantage of using dropout in a CNN?
Preventing overfitting
89
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
90
Which statement about convolution layers is true?
Convolution layers are linear and often used with activation functions.
91
What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?
3x64x4x4
92
Which layer in PyTorch downsamples the input size into half?
nn.Conv2d(in_channels