Hands-on AI II Flashcards
(74 cards)
Which of the following are true about the ‘softmax’ function?
a. It is often used in the output layer of a multi-class classification network.
b. It converts a vector of real numbers (logits) into a probability distribution, where all elements are between 0 and 1 and sum to 1.
c. It is a generalization of the sigmoid function to multiple classes.
d. It is defined as σ(z)ᵢ = zᵢ / Σzⱼ.
a. It is often used in the output layer of a multi-class classification network.
b. It converts a vector of real numbers (logits) into a probability distribution, where all elements are between 0 and 1 and sum to 1.
c. It is a generalization of the sigmoid function to multiple classes.
*Which of the following statements are correct regarding tabular data?
a. Each row typically represents a sample.
b. Each column typically represents a feature.
c. It can be used in both supervised and unsupervised learning.
d. It has a fixed set of properties (columns) for every record (row).
a. Each row typically represents a sample.
b. Each column typically represents a feature.
c. It can be used in both supervised and unsupervised learning.
d. It has a fixed set of properties (columns) for every record (row).
How does a one-to-many RNN architecture typically work?
a. It takes a single input vector and generates a sequence of output vectors.
b. An example application is image captioning, where the input is a feature vector from an image and the output is a sentence.
c. It processes a sequence of inputs to produce a single output.
d. It is the standard architecture for sentiment classification.
a. It takes a single input vector and generates a sequence of output vectors.
b. An example application is image captioning, where the input is a feature vector from an image and the output is a sentence.
What is ‘perplexity’ in the context of language modeling?
a. It is a measure of how well a probability model predicts a sample.
b. It is calculated as the exponentiation of the cross-entropy loss.
c. A lower perplexity indicates a better language model.
d. A perplexity of N means that the model is as confused as if it had to choose uniformly among N choices at each step.
a. It is a measure of how well a probability model predicts a sample.
b. It is calculated as the exponentiation of the cross-entropy loss.
c. A lower perplexity indicates a better language model.
d. A perplexity of N means that the model is as confused as if it had to choose uniformly among N choices at each step.
**Which of the following statements are true about SMILES (Simplified Molecular Input Line Entry Specification)?
a. It is a string sequence representation of a molecule.
b. A molecule can have different SMILES representations.
c. A molecular graph can be transformed into a SMILES string and the other way around.
d. A molecule can only have exactly one SMILES representation.
a. It is a string sequence representation of a molecule.
b. A molecule can have different SMILES representations.
c. A molecular graph can be transformed into a SMILES string and the other way around.
In the Q-learning update rule, Q(s,a) ← (1-α)Q(s,a) + α(r + γ maxₐ’ Q(s’,a’)), what is the role of the discount factor γ (gamma)?
a. It determines the learning rate of the update.
b. It determines the importance of future rewards. A value close to 0 makes the agent ‘myopic’ (short-sighted), while a value close to 1 makes it value long-term rewards highly.
c. It controls the trade-off between exploration and exploitation.
d. It ensures that the Q-values do not grow infinitely large.
b. It determines the importance of future rewards. A value close to 0 makes the agent ‘myopic’ (short-sighted), while a value close to 1 makes it value long-term rewards highly.
In a standard feed-forward neural network, how is the output of a single neuron calculated?
a. By summing the outputs of all neurons in the previous layer.
b. By calculating a weighted sum of the outputs from the previous layer, adding a bias, and then applying a non-linear activation function.
c. By simply applying a non-linear activation function to the input vector.
d. By performing a convolution operation on the inputs.
b. By calculating a weighted sum of the outputs from the previous layer, adding a bias, and then applying a non-linear activation function.
What is the purpose of a pooling layer in a CNN?
a. To increase the spatial dimensions of the feature maps.
b. To progressively reduce the spatial size of the representation.
c. To reduce the number of parameters and computational load in the network.
d. To make the representation more robust to small translations in the input.
b. To progressively reduce the spatial size of the representation.
c. To reduce the number of parameters and computational load in the network.
d. To make the representation more robust to small translations in the input.
**Which of the following statements are true about the vanishing gradient problem?
a. Repeated multiplication of gradients smaller than 1 leads to a vanishing gradient.
b. The choice of the activation functions plays a crucial role in the vanishing gradient problem.
c. A vanishing gradient can be mitigated by decreasing the learning rate.
d. It is more severe in the final layers of a deep network (closer to the output).
e. Vanishing gradients make the training of a neural network extremely difficult.
a. Repeated multiplication of gradients smaller than 1 leads to a vanishing gradient.
b. The choice of the activation functions plays a crucial role in the vanishing gradient problem.
e. Vanishing gradients make the training of a neural network extremely difficult.
**Which of the following statements is/are true about (deep) Q-learning?
a. Q-learning is one possible implementation of reinforcement learning.
b. Q-learning becomes computationally infeasible for larger MDPs (Markov decision processes) due to the large state-action space.
c. Deep Q-learning is about approximating the Q-value function using a neural network.
d. It is used to learn an optimal policy by estimating Q-values.
a. Q-learning is one possible implementation of reinforcement learning.
b. Q-learning becomes computationally infeasible for larger MDPs (Markov decision processes) due to the large state-action space.
c. Deep Q-learning is about approximating the Q-value function using a neural network.
d. It is used to learn an optimal policy by estimating Q-values.
**With respect to the vanishing gradient problem, which of the following statements are true regarding deep neural networks?
a. The deeper the network, the more multiplications (chain rule) we have to perform in the backward pass.
b. The vanishing gradient problem can get more severe when increasing the network depth.
c. The vanishing gradient problem will typically occur towards the input layer.
d. The problem is independent of the network’s depth and only depends on the activation function.
a. The deeper the network, the more multiplications (chain rule) we have to perform in the backward pass.
b. The vanishing gradient problem can get more severe when increasing the network depth.
c. The vanishing gradient problem will typically occur towards the input layer.
What is a ‘hyperparameter’ in machine learning?
a. A parameter of the model that is learned during the training process, such as a weight in a neural network.
b. A configuration that is set before the training process begins, such as the learning rate, the number of hidden layers, or the value of k in k-Means.
c. The output prediction of a model.
d. The loss function used to train a model.
b. A configuration that is set before the training process begins, such as the learning rate, the number of hidden layers, or the value of k in k-Means.
Which of the following are valid representations for small molecules in cheminformatics?
a. SMILES strings.
b. Molecular graphs, where atoms are nodes and bonds are edges.
c. Molecular fingerprints (e.g., Morgan fingerprints), which are binary vectors.
d. A 3D coordinate list for each atom.
a. SMILES strings.
b. Molecular graphs, where atoms are nodes and bonds are edges.
c. Molecular fingerprints (e.g., Morgan fingerprints), which are binary vectors.
d. A 3D coordinate list for each atom.
Which of these statements about the chain rule’s role in backpropagation is correct?
a. The chain rule is used to compute the gradient of a composite function.
b. In deep networks, the gradient of the loss with respect to an early layer’s weights is calculated by multiplying the derivatives of all subsequent layers.
c. This multiplicative nature is what can lead to the vanishing or exploding gradient problems.
d. The chain rule simplifies the computation by breaking down the gradient calculation into a product of local derivatives.
a. The chain rule is used to compute the gradient of a composite function.
b. In deep networks, the gradient of the loss with respect to an early layer’s weights is calculated by multiplying the derivatives of all subsequent layers.
c. This multiplicative nature is what can lead to the vanishing or exploding gradient problems.
d. The chain rule simplifies the computation by breaking down the gradient calculation into a product of local derivatives.
The convolution of a 64x64 grayscale image (1 channel) with 16 kernels of size 5x5 (with no padding and a stride of 1) produces…
a. … an output with 1 feature map.
b. … an output with 16 feature maps.
c. … an output volume of size 60x60x16.
d. … an output volume of size 64x64x16.
b. … an output with 16 feature maps.
c. … an output volume of size 60x60x16.
Which of the following is true about padding in Convolutional Neural Networks?
a. It is the process of adding extra pixels (usually zeros) around the border of an input image.
b. It can be used to control the spatial size of the output feature maps.
c. ‘Valid’ padding means no padding is applied, which typically causes the output feature map to be smaller than the input.
d. ‘Same’ padding aims to keep the output feature map the same size as the input feature map.
a. It is the process of adding extra pixels (usually zeros) around the border of an input image.
b. It can be used to control the spatial size of the output feature maps.
c. ‘Valid’ padding means no padding is applied, which typically causes the output feature map to be smaller than the input.
d. ‘Same’ padding aims to keep the output feature map the same size as the input feature map.
**Which of the following statements is/are true about QSAR (Quantitative Structure-Activity Relationship)?
a. The bio-activity of a molecule is determined by its molecular structure.
b. The hypothesis is that similar molecular structures have similar activities.
c. The hypothesis is that molecules with similar activities must have similar molecular structures.
d. It is primarily used for predicting the cost of drug development.
a. The bio-activity of a molecule is determined by its molecular structure.
b. The hypothesis is that similar molecular structures have similar activities.
**The original CEC (constant error carousel) of an LSTM …
a. … is responsible for countering the vanishing gradient problem.
b. … is responsible for going from the old cell state to the new cell state.
c. … uses non-linear functions to modify the cell state.
d. … is the primary cause of the vanishing gradient problem in RNNs.
a. … is responsible for countering the vanishing gradient problem.
b. … is responsible for going from the old cell state to the new cell state.
In the self-attention formula Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V, what do Q, K, and V represent?
a. Q, K, and V are learnable weight matrices.
b. Q, K, and V (Query, Key, Value) are vectors derived from the input embeddings for each token.
c. The dot product of a query (Q) and a key (K) determines an attention score, indicating how much focus to place on another token.
d. The value vectors (V) are averaged, weighted by the attention scores, to produce the output for each token.
b. Q, K, and V (Query, Key, Value) are vectors derived from the input embeddings for each token.
c. The dot product of a query (Q) and a key (K) determines an attention score, indicating how much focus to place on another token.
d. The value vectors (V) are averaged, weighted by the attention scores, to produce the output for each token.
**Which of the following are parts of a Markov decision process?
a. A set of possible states.
b. A set of possible actions.
c. A set of possible rewards.
d. A set of possible state transitions.
e. A policy function.
a. A set of possible states.
b. A set of possible actions.
*Which of the following statements are true regarding a loss function?
a. It measures how close the prediction is to the true target.
b. Typically, the lower the loss, the better the prediction.
c. The choice of loss function depends on the task (e.g., Cross-Entropy for classification, MSE for regression).
d. A loss of 0 indicates a perfect prediction on the given sample.
a. It measures how close the prediction is to the true target.
b. Typically, the lower the loss, the better the prediction.
c. The choice of loss function depends on the task (e.g., Cross-Entropy for classification, MSE for regression).
d. A loss of 0 indicates a perfect prediction on the given sample.
What is a ‘receptive field’ in a Convolutional Neural Network?
a. The entire input image that a neuron in the output layer can ‘see’.
b. The specific region of the input volume that a particular neuron in a convolutional layer is connected to.
c. As we go deeper into the network, the effective receptive field of the neurons increases.
d. It is a learnable parameter of the convolutional layer.
b. The specific region of the input volume that a particular neuron in a convolutional layer is connected to.
c. As we go deeper into the network, the effective receptive field of the neurons increases.
**Back-propagation through time (BPTT) …
a. … is commonly used to train an RNN.
b. … generates a computational graph that is (potentially) very deep, depending on sequence length.
c. … is the process of calculating the gradient of the loss with respect to an RNN’s weights.
d. … mitigates the vanishing gradient problem by design.
a. … is commonly used to train an RNN.
b. … generates a computational graph that is (potentially) very deep, depending on sequence length.
c. … is the process of calculating the gradient of the loss with respect to an RNN’s weights.
What is the primary advantage of the Transformer architecture over RNNs for language tasks?
a. Transformers process the entire input sequence at once, allowing for massive parallelization during training.
b. The self-attention mechanism allows the model to directly weigh the importance of all other words in the sequence when processing a given word.
c. Transformers are inherently better at handling short sequences than RNNs.
d. Transformers have fewer parameters than LSTMs, making them faster to train.
a. Transformers process the entire input sequence at once, allowing for massive parallelization during training.
b. The self-attention mechanism allows the model to directly weigh the importance of all other words in the sequence when processing a given word.