Hands-on AI II Flashcards by heinz meins

Which of the following are true about the ‘softmax’ function?
a. It is often used in the output layer of a multi-class classification network.
b. It converts a vector of real numbers (logits) into a probability distribution, where all elements are between 0 and 1 and sum to 1.
c. It is a generalization of the sigmoid function to multiple classes.
d. It is defined as σ(z)ᵢ = zᵢ / Σzⱼ.

a. It is often used in the output layer of a multi-class classification network.
b. It converts a vector of real numbers (logits) into a probability distribution, where all elements are between 0 and 1 and sum to 1.
c. It is a generalization of the sigmoid function to multiple classes.

How well did you know this?

Not at all

Perfectly

*Which of the following statements are correct regarding tabular data?
a. Each row typically represents a sample.
b. Each column typically represents a feature.
c. It can be used in both supervised and unsupervised learning.
d. It has a fixed set of properties (columns) for every record (row).

a. Each row typically represents a sample.
b. Each column typically represents a feature.
c. It can be used in both supervised and unsupervised learning.
d. It has a fixed set of properties (columns) for every record (row).

How well did you know this?

Not at all

Perfectly

How does a one-to-many RNN architecture typically work?
a. It takes a single input vector and generates a sequence of output vectors.
b. An example application is image captioning, where the input is a feature vector from an image and the output is a sentence.
c. It processes a sequence of inputs to produce a single output.
d. It is the standard architecture for sentiment classification.

a. It takes a single input vector and generates a sequence of output vectors.
b. An example application is image captioning, where the input is a feature vector from an image and the output is a sentence.

How well did you know this?

Not at all

Perfectly

What is ‘perplexity’ in the context of language modeling?
a. It is a measure of how well a probability model predicts a sample.
b. It is calculated as the exponentiation of the cross-entropy loss.
c. A lower perplexity indicates a better language model.
d. A perplexity of N means that the model is as confused as if it had to choose uniformly among N choices at each step.

a. It is a measure of how well a probability model predicts a sample.
b. It is calculated as the exponentiation of the cross-entropy loss.
c. A lower perplexity indicates a better language model.
d. A perplexity of N means that the model is as confused as if it had to choose uniformly among N choices at each step.

How well did you know this?

Not at all

Perfectly

**Which of the following statements are true about SMILES (Simplified Molecular Input Line Entry Specification)?
a. It is a string sequence representation of a molecule.
b. A molecule can have different SMILES representations.
c. A molecular graph can be transformed into a SMILES string and the other way around.
d. A molecule can only have exactly one SMILES representation.

a. It is a string sequence representation of a molecule.
b. A molecule can have different SMILES representations.
c. A molecular graph can be transformed into a SMILES string and the other way around.

How well did you know this?

Not at all

Perfectly

In the Q-learning update rule, Q(s,a) ← (1-α)Q(s,a) + α(r + γ maxₐ’ Q(s’,a’)), what is the role of the discount factor γ (gamma)?
a. It determines the learning rate of the update.
b. It determines the importance of future rewards. A value close to 0 makes the agent ‘myopic’ (short-sighted), while a value close to 1 makes it value long-term rewards highly.
c. It controls the trade-off between exploration and exploitation.
d. It ensures that the Q-values do not grow infinitely large.

b. It determines the importance of future rewards. A value close to 0 makes the agent ‘myopic’ (short-sighted), while a value close to 1 makes it value long-term rewards highly.

How well did you know this?

Not at all

Perfectly

In a standard feed-forward neural network, how is the output of a single neuron calculated?
a. By summing the outputs of all neurons in the previous layer.
b. By calculating a weighted sum of the outputs from the previous layer, adding a bias, and then applying a non-linear activation function.
c. By simply applying a non-linear activation function to the input vector.
d. By performing a convolution operation on the inputs.

b. By calculating a weighted sum of the outputs from the previous layer, adding a bias, and then applying a non-linear activation function.

How well did you know this?

Not at all

Perfectly

What is the purpose of a pooling layer in a CNN?
a. To increase the spatial dimensions of the feature maps.
b. To progressively reduce the spatial size of the representation.
c. To reduce the number of parameters and computational load in the network.
d. To make the representation more robust to small translations in the input.

b. To progressively reduce the spatial size of the representation.
c. To reduce the number of parameters and computational load in the network.
d. To make the representation more robust to small translations in the input.

How well did you know this?

Not at all

Perfectly

**Which of the following statements are true about the vanishing gradient problem?
a. Repeated multiplication of gradients smaller than 1 leads to a vanishing gradient.
b. The choice of the activation functions plays a crucial role in the vanishing gradient problem.
c. A vanishing gradient can be mitigated by decreasing the learning rate.
d. It is more severe in the final layers of a deep network (closer to the output).
e. Vanishing gradients make the training of a neural network extremely difficult.

a. Repeated multiplication of gradients smaller than 1 leads to a vanishing gradient.
b. The choice of the activation functions plays a crucial role in the vanishing gradient problem.
e. Vanishing gradients make the training of a neural network extremely difficult.

How well did you know this?

Not at all

Perfectly

**Which of the following statements is/are true about (deep) Q-learning?
a. Q-learning is one possible implementation of reinforcement learning.
b. Q-learning becomes computationally infeasible for larger MDPs (Markov decision processes) due to the large state-action space.
c. Deep Q-learning is about approximating the Q-value function using a neural network.
d. It is used to learn an optimal policy by estimating Q-values.

a. Q-learning is one possible implementation of reinforcement learning.
b. Q-learning becomes computationally infeasible for larger MDPs (Markov decision processes) due to the large state-action space.
c. Deep Q-learning is about approximating the Q-value function using a neural network.
d. It is used to learn an optimal policy by estimating Q-values.

How well did you know this?

Not at all

Perfectly

**With respect to the vanishing gradient problem, which of the following statements are true regarding deep neural networks?
a. The deeper the network, the more multiplications (chain rule) we have to perform in the backward pass.
b. The vanishing gradient problem can get more severe when increasing the network depth.
c. The vanishing gradient problem will typically occur towards the input layer.
d. The problem is independent of the network’s depth and only depends on the activation function.

a. The deeper the network, the more multiplications (chain rule) we have to perform in the backward pass.
b. The vanishing gradient problem can get more severe when increasing the network depth.
c. The vanishing gradient problem will typically occur towards the input layer.

How well did you know this?

Not at all

Perfectly

What is a ‘hyperparameter’ in machine learning?
a. A parameter of the model that is learned during the training process, such as a weight in a neural network.
b. A configuration that is set before the training process begins, such as the learning rate, the number of hidden layers, or the value of k in k-Means.
c. The output prediction of a model.
d. The loss function used to train a model.

b. A configuration that is set before the training process begins, such as the learning rate, the number of hidden layers, or the value of k in k-Means.

How well did you know this?

Not at all

Perfectly

Which of the following are valid representations for small molecules in cheminformatics?
a. SMILES strings.
b. Molecular graphs, where atoms are nodes and bonds are edges.
c. Molecular fingerprints (e.g., Morgan fingerprints), which are binary vectors.
d. A 3D coordinate list for each atom.

a. SMILES strings.
b. Molecular graphs, where atoms are nodes and bonds are edges.
c. Molecular fingerprints (e.g., Morgan fingerprints), which are binary vectors.
d. A 3D coordinate list for each atom.

How well did you know this?

Not at all

Perfectly

Which of these statements about the chain rule’s role in backpropagation is correct?
a. The chain rule is used to compute the gradient of a composite function.
b. In deep networks, the gradient of the loss with respect to an early layer’s weights is calculated by multiplying the derivatives of all subsequent layers.
c. This multiplicative nature is what can lead to the vanishing or exploding gradient problems.
d. The chain rule simplifies the computation by breaking down the gradient calculation into a product of local derivatives.

a. The chain rule is used to compute the gradient of a composite function.
b. In deep networks, the gradient of the loss with respect to an early layer’s weights is calculated by multiplying the derivatives of all subsequent layers.
c. This multiplicative nature is what can lead to the vanishing or exploding gradient problems.
d. The chain rule simplifies the computation by breaking down the gradient calculation into a product of local derivatives.

How well did you know this?

Not at all

Perfectly

The convolution of a 64x64 grayscale image (1 channel) with 16 kernels of size 5x5 (with no padding and a stride of 1) produces…
a. … an output with 1 feature map.
b. … an output with 16 feature maps.
c. … an output volume of size 60x60x16.
d. … an output volume of size 64x64x16.

b. … an output with 16 feature maps.
c. … an output volume of size 60x60x16.

How well did you know this?

Not at all

Perfectly

Which of the following is true about padding in Convolutional Neural Networks?
a. It is the process of adding extra pixels (usually zeros) around the border of an input image.
b. It can be used to control the spatial size of the output feature maps.
c. ‘Valid’ padding means no padding is applied, which typically causes the output feature map to be smaller than the input.
d. ‘Same’ padding aims to keep the output feature map the same size as the input feature map.

a. It is the process of adding extra pixels (usually zeros) around the border of an input image.
b. It can be used to control the spatial size of the output feature maps.
c. ‘Valid’ padding means no padding is applied, which typically causes the output feature map to be smaller than the input.
d. ‘Same’ padding aims to keep the output feature map the same size as the input feature map.

How well did you know this?

Not at all

Perfectly

**Which of the following statements is/are true about QSAR (Quantitative Structure-Activity Relationship)?
a. The bio-activity of a molecule is determined by its molecular structure.
b. The hypothesis is that similar molecular structures have similar activities.
c. The hypothesis is that molecules with similar activities must have similar molecular structures.
d. It is primarily used for predicting the cost of drug development.

a. The bio-activity of a molecule is determined by its molecular structure.
b. The hypothesis is that similar molecular structures have similar activities.

How well did you know this?

Not at all

Perfectly

**The original CEC (constant error carousel) of an LSTM …
a. … is responsible for countering the vanishing gradient problem.
b. … is responsible for going from the old cell state to the new cell state.
c. … uses non-linear functions to modify the cell state.
d. … is the primary cause of the vanishing gradient problem in RNNs.

a. … is responsible for countering the vanishing gradient problem.
b. … is responsible for going from the old cell state to the new cell state.

How well did you know this?

Not at all

Perfectly

In the self-attention formula Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V, what do Q, K, and V represent?
a. Q, K, and V are learnable weight matrices.
b. Q, K, and V (Query, Key, Value) are vectors derived from the input embeddings for each token.
c. The dot product of a query (Q) and a key (K) determines an attention score, indicating how much focus to place on another token.
d. The value vectors (V) are averaged, weighted by the attention scores, to produce the output for each token.

b. Q, K, and V (Query, Key, Value) are vectors derived from the input embeddings for each token.
c. The dot product of a query (Q) and a key (K) determines an attention score, indicating how much focus to place on another token.
d. The value vectors (V) are averaged, weighted by the attention scores, to produce the output for each token.

How well did you know this?

Not at all

Perfectly

**Which of the following are parts of a Markov decision process?
a. A set of possible states.
b. A set of possible actions.
c. A set of possible rewards.
d. A set of possible state transitions.
e. A policy function.

a. A set of possible states.
b. A set of possible actions.

How well did you know this?

Not at all

Perfectly

*Which of the following statements are true regarding a loss function?
a. It measures how close the prediction is to the true target.
b. Typically, the lower the loss, the better the prediction.
c. The choice of loss function depends on the task (e.g., Cross-Entropy for classification, MSE for regression).
d. A loss of 0 indicates a perfect prediction on the given sample.

a. It measures how close the prediction is to the true target.
b. Typically, the lower the loss, the better the prediction.
c. The choice of loss function depends on the task (e.g., Cross-Entropy for classification, MSE for regression).
d. A loss of 0 indicates a perfect prediction on the given sample.

How well did you know this?

Not at all

Perfectly

What is a ‘receptive field’ in a Convolutional Neural Network?
a. The entire input image that a neuron in the output layer can ‘see’.
b. The specific region of the input volume that a particular neuron in a convolutional layer is connected to.
c. As we go deeper into the network, the effective receptive field of the neurons increases.
d. It is a learnable parameter of the convolutional layer.

b. The specific region of the input volume that a particular neuron in a convolutional layer is connected to.
c. As we go deeper into the network, the effective receptive field of the neurons increases.

How well did you know this?

Not at all

Perfectly

**Back-propagation through time (BPTT) …
a. … is commonly used to train an RNN.
b. … generates a computational graph that is (potentially) very deep, depending on sequence length.
c. … is the process of calculating the gradient of the loss with respect to an RNN’s weights.
d. … mitigates the vanishing gradient problem by design.

a. … is commonly used to train an RNN.
b. … generates a computational graph that is (potentially) very deep, depending on sequence length.
c. … is the process of calculating the gradient of the loss with respect to an RNN’s weights.

How well did you know this?

Not at all

Perfectly

What is the primary advantage of the Transformer architecture over RNNs for language tasks?
a. Transformers process the entire input sequence at once, allowing for massive parallelization during training.
b. The self-attention mechanism allows the model to directly weigh the importance of all other words in the sequence when processing a given word.
c. Transformers are inherently better at handling short sequences than RNNs.
d. Transformers have fewer parameters than LSTMs, making them faster to train.

a. Transformers process the entire input sequence at once, allowing for massive parallelization during training.
b. The self-attention mechanism allows the model to directly weigh the importance of all other words in the sequence when processing a given word.

How well did you know this?

Not at all

Perfectly

What does it mean for a dataset to be 'i.i.d.' (independently and identically distributed)? a. Each data sample is drawn from a different probability distribution. b. Each data sample has the same probability distribution as the others, and all samples are mutually independent. c. This assumption is crucial for the test set to be a good estimator of the generalization error. d. Sequential data, like time series, generally violates the 'i.i.d.' assumption.

b. Each data sample has the same probability distribution as the others, and all samples are mutually independent. c. This assumption is crucial for the test set to be a good estimator of the generalization error. d. Sequential data, like time series, generally violates the 'i.i.d.' assumption.

What is the primary function of the decoder in an RNN-based language model? a. To convert an input word into a dense vector embedding. b. To process the sequence of embeddings and maintain a hidden state. c. To take the final hidden state from the RNN and map it to a probability distribution over the entire vocabulary. d. To select the most likely next word from the probability distribution.

c. To take the final hidden state from the RNN and map it to a probability distribution over the entire vocabulary.

**Which of the following statements is/are true about language models? a. Language modeling is the task of predicting a word given a context. b. During training of an RNN language model, at each timestep, the decoder provides the probability distribution of the next word over the entire vocabulary. c. Perplexity is a common metric to evaluate language models, where lower is better. d. They can only be built using Recurrent Neural Networks.

a. Language modeling is the task of predicting a word given a context. b. During training of an RNN language model, at each timestep, the decoder provides the probability distribution of the next word over the entire vocabulary. c. Perplexity is a common metric to evaluate language models, where lower is better.

Why is a non-linear activation function crucial in a multi-layer neural network? a. Without non-linearity, stacking multiple layers is equivalent to a single linear layer. b. They make the backpropagation process computationally cheaper. c. They allow the network to learn complex, non-linear relationships in the data. d. They ensure that the output of the network is always between 0 and 1.

a. Without non-linearity, stacking multiple layers is equivalent to a single linear layer. c. They allow the network to learn complex, non-linear relationships in the data.

Which statement best describes the difference between a model's parameters and its hyperparameters? a. Parameters are learned from the data (e.g., weights), while hyperparameters are set by the user before training (e.g., learning rate). b. Hyperparameters are learned from the data, while parameters are set by the user. c. Both are learned from the data, but hyperparameters are in the final layer. d. There is no fundamental difference; the terms are interchangeable.

a. Parameters are learned from the data (e.g., weights), while hyperparameters are set by the user before training (e.g., learning rate).

What is the main purpose of dimensionality reduction techniques like PCA and t-SNE? a. To increase the number of features in a dataset. b. To visualize high-dimensional data in a lower-dimensional space (e.g., 2D or 3D). c. To reduce the computational complexity of machine learning algorithms. d. To preserve as much of the data's variance or structure as possible while reducing dimensions. e. To always improve the performance of a classifier.

b. To visualize high-dimensional data in a lower-dimensional space (e.g., 2D or 3D). c. To reduce the computational complexity of machine learning algorithms. d. To preserve as much of the data's variance or structure as possible while reducing dimensions.

**Which of the following statements are true about data augmentation? a. It can help prevent overfitting. b. It might increase the robustness of a model. c. Some data augmentation techniques (e.g., adding noise) can be applied to various data types. d. It always improves model performance.

a. It can help prevent overfitting. b. It might increase the robustness of a model. c. Some data augmentation techniques (e.g., adding noise) can be applied to various data types.

In the context of the basic data analysis workflow, why is a held-out test set important? a. It is used to tune the model's hyperparameters. b. It provides an unbiased estimate of the model's performance on unseen data (generalization error). c. The model should only be evaluated on the test set once, after all training and hyperparameter tuning is complete. d. Repeatedly evaluating on the test set and choosing the model that performs best on it can lead to overfitting to the test set.

b. It provides an unbiased estimate of the model's performance on unseen data (generalization error). c. The model should only be evaluated on the test set once, after all training and hyperparameter tuning is complete. d. Repeatedly evaluating on the test set and choosing the model that performs best on it can lead to overfitting to the test set.

What does 'overfitting' mean in the context of supervised learning? a. The model performs very well on the training data but poorly on unseen test data. b. The model has learned the noise and specific artifacts of the training data, rather than the underlying general pattern. c. A highly complex model is more prone to overfitting than a simpler model. d. Overfitting can be detected by comparing the performance on the training set with the performance on a validation or test set.

a. The model performs very well on the training data but poorly on unseen test data. b. The model has learned the noise and specific artifacts of the training data, rather than the underlying general pattern. c. A highly complex model is more prone to overfitting than a simpler model. d. Overfitting can be detected by comparing the performance on the training set with the performance on a validation or test set.

Why might a learning rate schedule be beneficial during training? a. Using a large learning rate initially can help escape sharp, poor local minima and make faster progress. b. Decreasing the learning rate over time allows the model to settle into a good minimum more carefully. c. It helps to prevent the training from getting stuck with a learning rate that is too high or too low. d. A fixed learning rate is always optimal for all stages of training.

a. Using a large learning rate initially can help escape sharp, poor local minima and make faster progress. b. Decreasing the learning rate over time allows the model to settle into a good minimum more carefully. c. It helps to prevent the training from getting stuck with a learning rate that is too high or too low.

In the context of the basic data analysis workflow, what is 'preprocessing'? a. The final step where the model's answer is interpreted. b. An initial step that involves cleaning, normalizing, and transforming the raw data to make it suitable for a machine learning model. c. The process of choosing the model class (e.g., SVM, Neural Network). d. The process of training the model on the data.

b. An initial step that involves cleaning, normalizing, and transforming the raw data to make it suitable for a machine learning model.

**You are given the derivative of the logistic/sigmoid function σ'(x) and the following computation: y = σ'(10) * σ'(-10). Which of the following statements is/are correct? a. y is very close to 0. b. y is in the range (0, 1/16]. c. y is a positive value. d. y is exactly 0.

a. y is very close to 0. b. y is in the range (0, 1/16]. c. y is a positive value.

In Reinforcement Learning, what is the difference between model-based and model-free approaches? a. Model-free methods (like Q-learning) learn a policy or value function directly from experience without explicitly learning the environment's dynamics (transition probabilities and rewards). b. Model-based methods first try to learn a model of the environment and then use this model for planning (e.g., by simulating future trajectories). c. Model-based methods are generally more sample-efficient but can suffer if the learned model of the environment is inaccurate. d. There is no difference; all RL algorithms learn a model of the environment.

a. Model-free methods (like Q-learning) learn a policy or value function directly from experience without explicitly learning the environment's dynamics (transition probabilities and rewards). b. Model-based methods first try to learn a model of the environment and then use this model for planning (e.g., by simulating future trajectories). c. Model-based methods are generally more sample-efficient but can suffer if the learned model of the environment is inaccurate.

Which of the following statements about word embeddings are true? a. They represent words as dense, low-dimensional vectors. b. Words with similar meanings are expected to have similar vectors (i.e., be close in the vector space). c. Cosine similarity is a common metric to measure the similarity between two word vectors. d. One-hot encoding is a form of dense word embedding.

a. They represent words as dense, low-dimensional vectors. b. Words with similar meanings are expected to have similar vectors (i.e., be close in the vector space). c. Cosine similarity is a common metric to measure the similarity between two word vectors.

In a Transformer model, what is the purpose of positional encodings? a. To introduce non-linearity into the model. b. Since the self-attention mechanism does not inherently process sequential order, positional encodings are added to the input embeddings to give the model information about the position of tokens in the sequence. c. They are learnable parameters that are updated during training. d. They are fixed vectors calculated using sine and cosine functions of different frequencies.

b. Since the self-attention mechanism does not inherently process sequential order, positional encodings are added to the input embeddings to give the model information about the position of tokens in the sequence. d. They are fixed vectors calculated using sine and cosine functions of different frequencies.

Which of the following are potential solutions to the vanishing gradient problem in deep networks? a. Using ReLU or its variants (like Leaky ReLU) as activation functions. b. Using residual connections (as in ResNets). c. Using Batch Normalization. d. Using a very small, fixed learning rate. e. Using LSTM or GRU cells in recurrent networks.

a. Using ReLU or its variants (like Leaky ReLU) as activation functions. b. Using residual connections (as in ResNets). c. Using Batch Normalization. e. Using LSTM or GRU cells in recurrent networks.

**The 3D structure of a molecule ... a. ... determines its functionality to a high degree. b. ... is straightforward to predict if the sequence of atoms is known. c. ... is expensive to determine experimentally. d. ... can be different for the same molecular graph. e. ... is unambiguously determined by a molecular graph.

a. ... determines its functionality to a high degree. c. ... is expensive to determine experimentally. d. ... can be different for the same molecular graph.

Which of the following are unsupervised learning tasks? a. k-Means Clustering b. Linear Regression c. Principal Component Analysis (PCA) d. Classifying images of cats and dogs. e. Density Based Clustering (DBSCAN)

a. k-Means Clustering c. Principal Component Analysis (PCA) e. Density Based Clustering (DBSCAN)

The protein folding problem, famously tackled by AlphaFold, is about: a. Predicting the amino acid sequence of a protein from its 3D structure. b. Predicting the 3D structure of a protein from its amino acid sequence. c. Synthesizing new proteins in a lab. d. Finding which drugs will bind to a folded protein.

b. Predicting the 3D structure of a protein from its amino acid sequence.

What is the core principle of weight sharing in Convolutional Neural Networks (CNNs)? a. Each neuron in a feature map shares its weights with neurons in different feature maps. b. The same set of weights (kernel/filter) is applied across different locations of the input image. c. It drastically reduces the number of learnable parameters compared to a fully-connected network. d. It helps the network to detect the same feature (e.g., an edge) regardless of its position in the image, promoting translation invariance.

b. The same set of weights (kernel/filter) is applied across different locations of the input image. c. It drastically reduces the number of learnable parameters compared to a fully-connected network. d. It helps the network to detect the same feature (e.g., an edge) regardless of its position in the image, promoting translation invariance.

What is the primary motivation for using Batch Normalization? a. To reduce the effect of internal covariate shift, which is the change in the distribution of network activations due to changes in network parameters during training. b. To act as a strong regularizer, often reducing the need for Dropout. c. To allow for the use of much higher learning rates and accelerate training. d. To ensure all weights in the network remain positive.

a. To reduce the effect of internal covariate shift, which is the change in the distribution of network activations due to changes in network parameters during training. b. To act as a strong regularizer, often reducing the need for Dropout. c. To allow for the use of much higher learning rates and accelerate training.

**Which of the following statements are true about the weights of an RNN? a. For each timestep, the same (shared) weight matrix is used. b. The size of the weight matrix is independent of the sequence length. c. For each timestep, a dedicated, new weight matrix is learned. d. For sequences of different length, multiple (shared) weight matrices must be learned (one for each unique sequence length).

a. For each timestep, the same (shared) weight matrix is used. b. The size of the weight matrix is independent of the sequence length.

What distinguishes Density-Based Spatial Clustering of Applications with Noise (DBSCAN) from k-Means? a. DBSCAN does not require the user to specify the number of clusters beforehand. b. DBSCAN can find arbitrarily shaped clusters. c. DBSCAN can identify points as outliers (noise). d. DBSCAN is a parametric clustering algorithm.

a. DBSCAN does not require the user to specify the number of clusters beforehand. b. DBSCAN can find arbitrarily shaped clusters. c. DBSCAN can identify points as outliers (noise).

*In regression, ... a. ... the target value is numeric. b. ... we are dealing with a supervised learning scenario. c. ... Mean Squared Error (MSE) is a common loss function. d. ... the output of the model is a continuous value.

a. ... the target value is numeric. b. ... we are dealing with a supervised learning scenario. c. ... Mean Squared Error (MSE) is a common loss function. d. ... the output of the model is a continuous value.

What is the 'exploration vs. exploitation' trade-off in reinforcement learning? a. The dilemma of choosing between actions that have been effective in the past (exploitation) and trying new actions to discover potentially better rewards (exploration). b. An epsilon-greedy strategy is a common way to manage this trade-off. c. Pure exploitation can lead to getting stuck in a suboptimal policy. d. Pure exploration means acting randomly and may prevent the agent from learning an effective policy.

a. The dilemma of choosing between actions that have been effective in the past (exploitation) and trying new actions to discover potentially better rewards (exploration). b. An epsilon-greedy strategy is a common way to manage this trade-off. c. Pure exploitation can lead to getting stuck in a suboptimal policy. d. Pure exploration means acting randomly and may prevent the agent from learning an effective policy.

*Which of the following statements are true about text generation with language models? a. Given a starting word, always choosing the next word with the highest predicted probability will always yield the same generated text. b. Sampling from the predicted output probability distribution can introduce diversity in the generated text. c. A "temperature" parameter can be used to control the randomness of the sampling process. d. The language model needs to be retrained for every new starting word.

a. Given a starting word, always choosing the next word with the highest predicted probability will always yield the same generated text. b. Sampling from the predicted output probability distribution can introduce diversity in the generated text. c. A "temperature" parameter can be used to control the randomness of the sampling process.

*Which of the following statements are true about many-to-many RNN types? a. Given an input of length Tx, the output/prediction length can be equal, i.e., Tx = Ty. b. Given an input of length Tx, the output/prediction length can be unequal, i.e., Tx != Ty. c. Machine translation is an example where Tx and Ty can be different. d. Named Entity Recognition is an example where Tx and Ty are typically equal.

a. Given an input of length Tx, the output/prediction length can be equal, i.e., Tx = Ty. b. Given an input of length Tx, the output/prediction length can be unequal, i.e., Tx != Ty. c. Machine translation is an example where Tx and Ty can be different. d. Named Entity Recognition is an example where Tx and Ty are typically equal.

What problem does subword tokenization (like BPE or SentencePiece) solve? a. It helps handle rare or out-of-vocabulary (OOV) words by breaking them down into smaller, known units. b. It allows the model to have a fixed-size vocabulary while still being able to represent an open-ended set of words. c. It can capture morphological similarities between words (e.g., 'running' and 'ran' might share the subword 'run'). d. It always results in longer sequences than word-level tokenization.

a. It helps handle rare or out-of-vocabulary (OOV) words by breaking them down into smaller, known units. b. It allows the model to have a fixed-size vocabulary while still being able to represent an open-ended set of words. c. It can capture morphological similarities between words (e.g., 'running' and 'ran' might share the subword 'run').

What are molecular fingerprints? a. A sequence representation of a molecule like SMILES. b. A method to represent molecular structures as binary vectors. c. Each bit in the vector typically corresponds to the presence or absence of a specific substructure or chemical feature. d. They are a form of molecular descriptor used as input for machine learning models.

b. A method to represent molecular structures as binary vectors. c. Each bit in the vector typically corresponds to the presence or absence of a specific substructure or chemical feature. d. They are a form of molecular descriptor used as input for machine learning models.

**In classification, ... a. ... the target value is a class label. b. ... we are dealing with a supervised machine learning scenario. c. ... the target can be a number (e.g., 0, 1, 2 for different classes). d. ... there can only be two classes.

a. ... the target value is a class label. b. ... we are dealing with a supervised machine learning scenario. c. ... the target can be a number (e.g., 0, 1, 2 for different classes).

**What is the general data flow through an RNN language model? a. input -> encoder -> RNN -> decoder b. input -> decoder -> RNN -> encoder c. input -> RNN -> encoder -> decoder d. input -> encoder -> decoder -> RNN

a. input -> encoder -> RNN -> decoder

**Which of the following statements is/are true about the ReLU activation function? a. Its derivative is 0 for negative inputs. b. Its derivative is 1 for positive inputs. c. It can help mitigate the vanishing gradient problem. d. For positive inputs, ReLU equates to the identity function. e. It is computationally more expensive than the sigmoid function.

a. Its derivative is 0 for negative inputs. b. Its derivative is 1 for positive inputs. c. It can help mitigate the vanishing gradient problem. d. For positive inputs, ReLU equates to the identity function.

**A non-convex function... a. ... is common in deep learning loss landscapes. b. ... often requires iterative methods like gradient descent to find a minimum. c. ... might have several local minima. d. ... typically does not have a closed-form solution for finding the minimum.

a. ... is common in deep learning loss landscapes. b. ... often requires iterative methods like gradient descent to find a minimum. c. ... might have several local minima. d. ... typically does not have a closed-form solution for finding the minimum.

What are 'residual connections' as used in networks like ResNet? a. They are connections that add the input of a layer (or block of layers) to its output. b. They create a 'shortcut' for the gradient to flow through during backpropagation. c. They help in training very deep neural networks by mitigating the vanishing gradient problem. d. They are a form of data augmentation.

a. They are connections that add the input of a layer (or block of layers) to its output. b. They create a 'shortcut' for the gradient to flow through during backpropagation. c. They help in training very deep neural networks by mitigating the vanishing gradient problem.

**Which of the following statements are true about virtual screening (VS)? a. In VS, a trained neural network predicts whether a molecule from a database is likely to bind/react with the target of interest. b. VS helps to narrow down the number of candidates for expensive experimental testing (assays). c. It is a computational technique used in the early stages of drug discovery. d. VS relies on the QSAR principle.

a. In VS, a trained neural network predicts whether a molecule from a database is likely to bind/react with the target of interest. b. VS helps to narrow down the number of candidates for expensive experimental testing (assays). c. It is a computational technique used in the early stages of drug discovery. d. VS relies on the QSAR principle.

*Which of the following statements are true about a standard, single-layer RNN? a. The input sequence must be fed timestep by timestep to the RNN. b. The hidden state is updated at each timestep. c. It can struggle with long-term dependencies due to vanishing gradients. d. Unrolling means representing the recurrent computation over time as a deep feed-forward network.

a. The input sequence must be fed timestep by timestep to the RNN. b. The hidden state is updated at each timestep. c. It can struggle with long-term dependencies due to vanishing gradients. d. Unrolling means representing the recurrent computation over time as a deep feed-forward network.

In an LSTM, what is the role of the forget gate? a. It decides what new information to store in the cell state. b. It decides what information from the previous cell state should be discarded. c. It decides what part of the cell state should be used to compute the hidden state output. d. It is responsible for resetting the entire cell state to zero at every timestep.

b. It decides what information from the previous cell state should be discarded.

*Which of the following activation functions are prone to the vanishing gradient problem? a. Sigmoid b. Tanh c. ReLU d. Leaky ReLU

a. Sigmoid b. Tanh

*Which of the following activation functions can be used to mitigate the vanishing gradient problem? a. ReLU b. Leaky ReLU c. Sigmoid d. SELU

a. ReLU b. Leaky ReLU d. SELU

In the context of the RGB color model, what does an 8-bit color depth signify? a. Each of the three channels (Red, Green, Blue) can take on one of 8 possible values. b. The total number of possible colors is 8. c. Each pixel is represented by 8 bits in total. d. Each of the three channels (Red, Green, Blue) is represented by 8 bits, allowing for 256 different intensity values per channel.

d. Each of the three channels (Red, Green, Blue) is represented by 8 bits, allowing for 256 different intensity values per channel.

When training a neural network, what does a single 'epoch' refer to? a. A single pass of the gradient descent algorithm over one batch of data. b. One complete forward and backward pass for a single training example. c. One complete pass through the entire training dataset. d. The process of evaluating the model on the test set.

c. One complete pass through the entire training dataset.

*Which of the following statements are true when comparing an LSTM to a standard recurrent neural network? a. An LSTM is better able to learn long-term dependencies. b. An LSTM is less prone to the vanishing gradient problem. c. An LSTM has more model parameters. d. An LSTM is computationally more expensive.

a. An LSTM is better able to learn long-term dependencies. b. An LSTM is less prone to the vanishing gradient problem. c. An LSTM has more model parameters. d. An LSTM is computationally more expensive.

**Which of the following statements is/are true about gating/gates in an LSTM? a. Gating is used to control what to read/write/forget in memory. b. Typically, the sigmoid function is used as the activation for gates to produce values between 0 and 1. c. Gates depend on the current input and the hidden state from the previous timestep. d. The forget gate was part of the original LSTM design by Hochreiter & Schmidhuber.

a. Gating is used to control what to read/write/forget in memory. b. Typically, the sigmoid function is used as the activation for gates to produce values between 0 and 1. c. Gates depend on the current input and the hidden state from the previous timestep.

Which of the following statements correctly describe k-Means clustering? a. It is a supervised learning algorithm. b. The user must specify the number of clusters (k) beforehand. c. The algorithm iteratively assigns data points to the nearest cluster centroid and then recalculates the centroid. d. It is guaranteed to find the globally optimal clustering. e. It is a non-parametric clustering algorithm.

b. The user must specify the number of clusters (k) beforehand. c. The algorithm iteratively assigns data points to the nearest cluster centroid and then recalculates the centroid. e. It is a non-parametric clustering algorithm.

**Which of the following statements are true about sequence data input and standard, feed-forward neural networks? a. The relation of elements in the sequence is lost. b. Sequences with variable lengths must be preprocessed (e.g., padded or truncated) to have a fixed size. c. A sliding window approach can be used to capture some local sequential information. d. A feed-forward network treats the input as an unordered set of features.

a. The relation of elements in the sequence is lost. b. Sequences with variable lengths must be preprocessed (e.g., padded or truncated) to have a fixed size. c. A sliding window approach can be used to capture some local sequential information.

How does Dropout work as a regularization technique? a. During training, it permanently removes a fixed percentage of neurons from the network. b. During training, it randomly sets the output of a fraction of neurons to zero for each training example. c. It forces the network to learn more robust features that are not dependent on the presence of specific other neurons. d. At test time, all neurons are used, but their weights are scaled down to compensate for the effect of dropout during training.

b. During training, it randomly sets the output of a fraction of neurons to zero for each training example. c. It forces the network to learn more robust features that are not dependent on the presence of specific other neurons. d. At test time, all neurons are used, but their weights are scaled down to compensate for the effect of dropout during training.

**Which of the following statements are true about recurrent neural networks? a. They are suitable for processing sequences of variable length. b. They are suitable for processing sequences of fixed length. c. They incorporate information from previous timesteps/positions of a sequence via the hidden state. d. The input sequence must be fed timestep by timestep to the RNN.

a. They are suitable for processing sequences of variable length. b. They are suitable for processing sequences of fixed length. c. They incorporate information from previous timesteps/positions of a sequence via the hidden state. d. The input sequence must be fed timestep by timestep to the RNN.

Which of these statements accurately describes the difference between L1 and L2 regularization? a. L1 regularization adds the sum of the absolute values of the weights to the loss function. b. L2 regularization adds the sum of the squared values of the weights to the loss function. c. L1 regularization can lead to sparse models, where some weights become exactly zero. d. L2 regularization is also known as 'weight decay'.

a. L1 regularization adds the sum of the absolute values of the weights to the loss function. b. L2 regularization adds the sum of the squared values of the weights to the loss function. c. L1 regularization can lead to sparse models, where some weights become exactly zero. d. L2 regularization is also known as 'weight decay'.

What is Transfer Learning in the context of deep learning? a. Training a model from scratch on a new, small dataset. b. Taking a model pre-trained on a large dataset (e.g., ImageNet) and adapting it for a new, related task. c. Freezing the weights of the early convolutional layers and only training the final classification layers on the new task. d. It is a technique that generally requires less data and less computation time for the new task compared to training from scratch.

b. Taking a model pre-trained on a large dataset (e.g., ImageNet) and adapting it for a new, related task. c. Freezing the weights of the early convolutional layers and only training the final classification layers on the new task. d. It is a technique that generally requires less data and less computation time for the new task compared to training from scratch.

**Which of the following statements are true about the concept of an optimal policy in Reinforcement Learning? a. It maximizes the sum of rewards over an episode. b. There can be multiple optimal policies for a single MDP. c. It can be found using Q-learning. d. An optimal policy is one that always chooses the action with the highest immediate reward.

a. It maximizes the sum of rewards over an episode. b. There can be multiple optimal policies for a single MDP. c. It can be found using Q-learning.

Hands-on AI II Flashcards

(74 cards)