Machine Learning Flashcards by Mahmoud Hossam

Which of the following are the advantages of transformers over a recurrent sequence model?
a) better at learning long-range dependencies
b) Slower to train and run-on modern hardware
c) require many fewer parameters to achieve similar results
d) none of the above

a) better at learning long-range dependencies

How well did you know this?

Not at all

Perfectly

Which of these parts of the self-attention operation are calculated by passing inputs through MLP?
a) values
b) keys
c) queries
d) all the above

d) all the above

How well did you know this?

Not at all

Perfectly

What is the field of natural language processing (NLP)?
a) computer science
b) artificial intelligence
c) linguistics
d) all of the mentioned

d) all of the mentioned

How well did you know this?

Not at all

Perfectly

What is the main challenge/s of NLP?
a) handling ambiguity of sentences
b) handling tokenization
c) handling pos-tagging
d) All of the mentioned

a) handling ambiguity of sentences

How well did you know this?

Not at all

Perfectly

What is machine translation?
a) Converts one human language to another
b) Converts human language to machine language
c) Converts any human language to English
d) Converts machine language to human language

a) Converts one human language to another

How well did you know this?

Not at all

Perfectly

choose from the following areas where NLP can be useful.
a) automatic text summarization
b) automatic question-answering systems
c) information retrieval
d) all the mentioned

d) all the mentioned

How well did you know this?

Not at all

Perfectly

Which of the following properties will a good position encoding ideally have?
a) unique for all positions
b) relative distances are independent of absolute sequence position
c) well-defined for arbitrary sequence lengths
d) all the above

d) all the above

How well did you know this?

Not at all

Perfectly

Which of the following includes the major tasks of NLP?
a) automatic summarization
b) discourse analysis
c) machine translation
d) all the mentioned

d) all the mentioned

How well did you know this?

Not at all

Perfectly

Neural machine translation was based on encoder-decoder _____
a) RNNs
b) LSTMs
c) both a & b
d) neither a & b

c) both a & b

How well did you know this?

Not at all

Perfectly

The encoder LSTM is used to process the _____ sentence.
a) input
b) output
c) function
d) All the above

a) input

How well did you know this?

Not at all

Perfectly

What is the type of autoencoder?
a) Supervised neural network
b) unsupervised neural network
c) semi-supervised neural network
d) reinforcement neural network

b) unsupervised neural network

How well did you know this?

Not at all

Perfectly

What type of data can the autoencoder apply dimensionality reduction on?
a) linear data
b) nonlinear data
c) both a & b
d) none of the above

c) both a & b

How well did you know this?

Not at all

Perfectly

A module that compresses data into an encoded representation that is typically several orders of magnitude smaller than the input data.
a) The encoder
b) Bottleneck
c) The decoder
d) None of the above

a) The encoder

How well did you know this?

Not at all

Perfectly

a module that contains the compressed knowledge representation and considers the most important part of the autoencoder network?
a) the encoder
b) bottleneck
c) the decoder
d) None of the above

b) bottleneck

How well did you know this?

Not at all

Perfectly

A module that helps the network “decompress” the knowledge representations and reconstructs the data back from its encoded form.
a) input layer
b) bottleneck
c) output layer
d) none of the above

c) output layer

How well did you know this?

Not at all

Perfectly

What type of autoencoders work by penalizing the activation of some
neurons in hidden layers?
a) Sparse autoencoder
b) Variational autoencoder
c) Deep autoencoder
d) Convolution autoencoders

a) Sparse autoencoder

How well did you know this?

Not at all

Perfectly

Which of the following is done by a deep autoencoder?
a) image reconstruction
b) image colorization
c) image search
d) image denoising

c) image search

How well did you know this?

Not at all

Perfectly

Which of the following is done by a convolution autoencoder?
a) data compression
b) image search
c) information retrieval
d) image colorization

d) image colorization

How well did you know this?

Not at all

Perfectly

Which of the following is an autoencoder application?
a) watermark removing
b) dimensionality reduction
c) image generation
d) all the above

d) all the above

How well did you know this?

Not at all

Perfectly

Which autoencoder doesn’t require reducing the bottleneck nodes?
a) sparse autoencoder
b) deep autoencoder
c) variational autoencoder
d) None of the above

a) sparse autoencoder

How well did you know this?

Not at all

Perfectly

in NLP, bidirectional context is supported by which of the following embedding
a) WORD2VEC
b) BERT
c) GLOVE
d) All the above

b) BERT

How well did you know this?

Not at all

Perfectly

For a given token, its input representation is the sum of embedding from the token, segment, and position
a) ELMO
b) GPT
c) BERT
d) none of the above

c) BERT

How well did you know this?

Not at all

Perfectly

BERT Base Contains _____ encoder layers
a) 12
b) 24
c) 36
d) 48

a) 12

How well did you know this?

Not at all

Perfectly

BERT large Contains _____ encoder layers
a) 12
b) 24
c) 36
d) 48

b) 24

How well did you know this?

Not at all

Perfectly

BERT aims at tackling various NLP tasks such as _____ a) question answering b) language inference c) text summarization d) all of the mentioned

d) all of the mentioned

The BERT model is pre-trained on relatively generic tasks a) masked language modeling (MLM) b) next sentence prediction c) a and b d) none of the mentioned

c) a and b

_______ Is to hide a word in a sentence and then have the program predict what word has been hidden (masked) based on the hidden word's context. a) Masked language modeling (MLM) b) Next sentence prediction c) Sequence classification d) Named entity recognition (NER)

a) Masked language modeling (MLM)

_______ is to have the program predict whether two given sentences have a logical, sequential connection or whether their relationship is simply random a) Masked language modeling (MLM) b) Next sentence prediction c) Sequence classification d) Named entity recognition (NER)

b) Next sentence prediction

BERT Can process text ______ a) left-to-right b) right-to-left c) both d) none of the mention

c) both

BERT was created and published in 2018 By ______ a) Amazon b) Microsoft c) IBM d) Google

d) Google

What is the difference between CNN and ANN? a) CNN has one or more layers of convolution units, which receive its input from multiple units. b) CNN uses a simpler algorithm than ann. c) They complete each other, so to use ANN, you need to start with CNN. d) CNN is the easiest way to use neural networks.

a) CNN has one or more layers of convolution units, which receive its input from multiple units.

The data fed into the model and output from each layer is obtained. this step is called. a) Feed forward b) Feed backward c) Input layer d) Output layer

a) Feed forward

Common types of pooling layers. a) 5 b) 2 c) 3 d) 4

b) 2

computes the output volume by the computing dot product between all filters and image patches. a) Input layer b) Convolution layer c) Activation function layer d) Pool layer

b) Convolution layer

What is back propagation? a) it is another name given to the curvy function in the perceptron b) it is the transmission of error back through the network to adjust the inputs c) it is the transmission of error back through the network to allow weights to be adjusted so that the network can learn d) all of the mentioned

c) it is the transmission of error back through the network to allow weights to be adjusted so that the network can learn

Which of the following functions can be used as an activation function in the output layer if we wish to predict the probabilities of n classes (p1, p2...pk) such that the sum of p over all n equals 1? a) RELU b) Sigmoid c) Softmax d) Tanh

c) Softmax

Which of the following would have a constant input in each epoch of training a deep learning model? a) Weight between input and hidden layer b) Weight between hidden and output layer c) Biases of all hidden layer neurons d) Activation function of output layer

a) Weight between input and hidden layer

Which of the following neural network training challenges can be solved using batch normalization? a) overfitting b) underfitting c) training is too slow d) none of the mentioned

c) training is too slow

The number of nodes in the input layer is 10 and the hidden layer is 5. the maximum number of connections from the input layer to the hidden layer are? a) 50 b) Less than 50 c) More than 50 d) None of the mentioned

a) 50

Is Deep Learning a specialized subset of machine learning? a) true b) false

a) true

_____ are models, used to generate data similar to the data on which they are trained, by destroying training data through the successive addition of gaussian noise, and then learning to recover the data by reversing this noising process. a) Federal learning. b) Attention learning. c) CNN. d) Diffusion models.

d) Diffusion models.

What is the goal of training a diffusion model? a) Learn the reverse process b) Learn to understand the image c) Extract the image features d) Classify the images

a) Learn the reverse process

one of the benefits of the diffusion model is _____ a) scalability b) not requiring adversarial training. c) parallelizability d) all of the above.

d) all of the above.

in general diffusion model consist of _____ main process a) 5 b)4 c) 3 d)2

d)2

A diffusion model is trained by finding the reverse Markov transitions that the likelihood of the training data. a) Maximize b) Minimize. c) Increase. d) Decrease.

a) Maximize

for the reverse process in the diffusion model, we much choose the _____ a) the Sobel filter b) Laplacian operator c)thresholding method d)the gaussian distribution parameterization

d)the gaussian distribution parameterization

the transition distributions in the Markov chain are gaussian, where the forward process requires a ______, and the reverse process parameters are learned. a) variance schedule b) Laplacian operator. c)the gaussian distribution parameterization. d)none of the mentioned.

d)none of the mentioned.

our diffusion model is parameterized as a Markov chain, meaning that our latent variables depend only on the _____ timestep a) previous or following b) previous c) following d) none of the mentioned

a) previous or following

a _____ is used to obtain log-likelihoods across pixel values as the last step in the reverse diffusion process. a) kl divergences b) simplified training objective c) u-net-like. d) discrete decoder.

d) discrete decoder.

diffusion models can be applied to a) image denoising b) super-resolution. c) image generation. d) all of the above.

d) all of the above.

What is the main goal of federated learning? a) to train a single machine learning model on a centralized dataset b) to train multiple machine learning models on decentralized datasets c) to train a single machine learning model on decentralized datasets d) to train multiple machine learning models on a centralized dataset

c) to train a single machine learning model on decentralized datasets

How does federated learning differ from traditional machine learning? a) federated learning requires less data b) federated learning requires more computational resources c) federated learning requires less communication bandwidth d) federated learning requires more data privacy concerns

d) federated learning requires more data privacy concerns

What is an advantage of federated learning compared to traditional centralized training? a) it is more accurate b) it is faster c) it requires less data d) it allows for decentralized data to be used

d) it allows for decentralized data to be used

How is data privacy protected in federated learning? a) data is encrypted before being sent to the centralized server b) data is never shared with any other parties c) data remains on the individual devices and is only used for model training d) data is aggregated and anonym zed before being used for model training

c) data remains on the individual devices and is only used for model training

In federated learning, who is responsible for training the model? a) a centralized server b) a third-party organization c) individual clients d) the data owner

c) individual clients

key benefits of federated learning……. a) it involves more diverse data. b) it’s secure. c) it yields real-time predictions. d) all of the above

d) all of the above

What are the challenges of federated learning? a) efficient communication across the federated network. b) managing heterogeneous systems in the same networks. c) privacy concerns and privacy-preserving methods. d) all of the above

d) all of the above

How does federated learning work? a) Transfer of weights and biases to cloud server b) Transfer of data to cloud server c) Transfer of model to cloud server d) Transfer of user info to cloud

a) Transfer of weights and biases to cloud server

Is federated learning more efficient than standard ml techniques for a large number of devices? a) True b) False c) Depends on use case d) Cannot say

a) True

federated learning is ______ a) Supervised b) Unsupervised c) Reinforcement learning. d) None of the above

b) Unsupervised

What is the basic concept of recurrent neural network? a) use a loop between inputs and outputs in order to achieve the better prediction. b) use recurrent features from dataset to find the best answers. c) use previous inputs to find the next output according to the training set. d) use loops between the most important features to predict next output.

c) use previous inputs to find the next output according to the training set.

The other RNN´s issue is called 'vanishing gradients'. what is that? a) when the values of a gradient are too small and the model joins in a loop because of that. b) when the values of a gradient are too big and the model stops learning or takes way too long because of that. c) when the values of a gradient are too small and the model stops learning or takes way too long because of that. d) when the values of a gradient are too big and the model joins in a loop because of that.

c) when the values of a gradient are too small and the model stops learning or takes way too long because of that.

LSTM. What is that? a) LSTM networks are an extension for recurrent neural networks, which basically extends their memory. therefore, it is well suited to learn from important experiences that have very low time lags in between b) LSTM networks are an extension for recurrent neural networks, which basically extends their memory. therefore, it is not recommended to use it, unless you are using a small dataset. c) LSTM networks are an extension for recurrent neural networks, which basically extends their memory. therefore, it is well suited to learn from important experiences that have long-time lags in between d) LSTM networks are an extension for recurrent neural networks, which basically shorten their memory. therefore, it is well suited to learn from important experiences that have very low time lags in between

c) LSTM networks are an extension for recurrent neural networks, which basically extends their memory. therefore, it is well suited to learn from important experiences that have long-time lags in between

The network that involves backward links from output to the input and hidden layers is called _________ a) self-organizing maps b) perceptron c) recurrent neural network d) multi layered perceptron

c) recurrent neural network

RNNs Stands for? a) Recurrent neural networks b) Report neural networks c) Receives neural networks d) Recording neural networks

a) Recurrent neural networks

What is the activation function used in forget gate? a) Sigmoid b) Tanh c) RELU d) None of the above

a) Sigmoid

How Many Gates Are There In LSTM? a) 3 b) 5 c) 4 d) 2

a) 3

……, When the points in the dataset are dependent on the other points in the dataset. a) continuous data b) discrete data c) sequential data d) ordinal data

c) sequential data

……… helps to identify important elements that need to be added to the cell state. a) Forget gate b) Input gate c) Output gate d) None of the above

b) Input gate

LSTM used in …… a) speech recognition b) music composition c) time series prediction d) all of the above

d) all of the above

what should be the aim of training procedure in Boltzmann machine of feedback networks? a) to capture inputs b) to feedback the captured outputs c) to capture the behavior of system d) none of the mentioned

d) none of the mentioned

What consist of Boltzmann machine? a) fully connected network with both hidden and visible units b) asynchronous operation c) stochastic update d) all of the mentioned

d) all of the mentioned

by using which method, Boltzmann machine reduces the effect of additional stable states? a) No such method exists b) Simulated annealing c) Hopfield reduction d) None of the mentioned

b) Simulated annealing

for which another task can Boltzmann machine be used? a) pattern mapping b) feature mapping c) classification d) pattern association

d) pattern association

Presence of false minima will have what effect on probability of error in recall? a) Directly b) Inversely c) No effect d) Directly or Inversely

a) Directly

What happens when we use mean field approximation with Boltzmann learning? a) It slows down b) It gets speeded up c) Nothing happens d) may speedup or speed down

b) It gets speeded up

in Boltzmann learning which algorithm can be used to arrive at equilibrium? a) Hopfield b) mean field c) Hebb d) none of the mentioned

d) none of the mentioned

All the visible layers in a restricted Boltzmann machine are not connected to each other. a) True b) False

a) True

What are the two layers of a restricted Boltzmann machine called? a) input and output layers b) recurrent and convolution layers c) activation and threshold layers d) hidden and visible layers

d) hidden and visible layers

A deep belief network is a stack of restricted Boltzmann machines. a) True b) False

a) True

the main and most important feature of RNN is _________. a) visible state b) hidden state c) present state d) None of these

b) hidden state

RNN remembers each and every information through________. a) Work b) Time c) Hours d) Memory

b) Time

to create a numerical representation of our text-based dataset we generate two lookup table, what are they_____. a) maps character to numbers b) maps numbers back to characters c) identify unique characters present in text d) both a & b

d) both a & b

_______occurs when the gradients become very small and tend towards zero. a) Exploding gradients b) Vanishing gradients c) Long short-term memory networks d) Gated recurrent unit networks.

b) Vanishing gradients

on what parameters can change in weight vector depend? a) learning parameters b) input vector c) learning signal d) all of the mentioned

d) all of the mentioned

________Occurs when the gradients become too large due to back-propagation. a) Exploding gradients b) Vanishing gradients c) Long short-term memory networks d) Gated recurrent unit networks

a) Exploding gradients

If a competitive network can perform feature mapping, then what is that network can be called? a) self-excitatory b) self-inhibitory c) self-organization d) none of the mentioned

c) self-organization

why do we need biological neural networks? a) to solve tasks like machine vision & natural language processing b) to apply heuristic search methods to find solutions of problem c) to make smart human interactive & user-friendly system d) all of the mentioned

d) all of the mentioned

what is auto-association task in neural networks? a) find relation between 2 consecutive inputs b) related to storage & recall task c) predicting the future inputs d) None of the mentioned

b) related to storage & recall task

What is unsupervised learning? a) features of group explicitly stated b) number of groups may be known c) neither feature & nor number of groups is known d) none of the mentioned

c) neither feature & nor number of groups is known

XLNet Is an ________ language model which outputs the joint probability of a sequence of tokens based on the transformer architecture with recurrence. a) Auto-regressive b) Auto-Negressive c) Objective d) Bidirectional

a) Auto-regressive

XLNet Is “Generalized” because it captures bi-directional context by means of a) mechanism called____ a) PLM b) BERT c) TRANSFORMER-XL d) MLM

a) mechanism called____

______ Keep track of the position of each token in a sequence (will know why we have this in the later sections) a) pretrain-finetune discrepancy b) transformer-xl c) positional encoding d) segment recurrence

c) positional encoding

______ cache the hidden state of first segment in memory in each layer and update attention accordingly. it allows reuse of memory for each segment. a) pretrain-finetune discrepancy b) transformer-xl c) positional encoding d) segment recurrence

d) segment recurrence

the attention weights determined by a simple feed forward neural network are____ a) query b) keys c) values d) all of the above

d) all of the above

_____ Traditional Methods predict the current token given previous “n” tokens, or predict the current token given all tokens after it. a) Bidirectional b) Masked language modeling c) XLNet d) BERT

a) Bidirectional

______Is A Neural Network architecture that can model bidirectional contexts in text data using transformer. a) BERT b) XLNet c) MLM d) PLM

a) BERT

A disadvantage of BERT is it corrupts the input with _______ and suffers from pretrain-finetune discrepancy. a) Mask b) PLM c) MLM d) All of above

a) Mask

XLNet Is the latest and greatest model to emerge from the booming field of natural language processing (NLP) a) True b) False

a) True

XLNet Is “Generalized” a) True b) False

a) True

The Attention Learning mechanism has changed the way we work with deep learning algorithm a) true b) false

a) true

the advantage of transformers over recurrent sequence model is slower to train and run on Modern Hardware a) true b) false

b) false

Fields like NLP and Computer Vision have been revolutionized by the attention mechanism a) true b) false

a) true

Attention Learning is an Interface connecting the Encoder and Decoder that provides the Decoder with Information a) true b) false

a) true

the encoder LSTM or RNN units produce the words in a sentence one after another a) true b) false

b) false

The Encoder reads the input sentence and tries to make sense of it a) true b) false

a) true

The LSTM is supposed to capture the Long-Range dependency better than the RNN a) true b) false

a) true

RNNs Can’t remember longer sentences and sequences a) true b) false

a) true

If the Encoder makes a bad summary, the translation will be also bad a) true b) false

a) true

the Decoder is used to process the entire input sentence and decode it into a Context Vector a) true b) false

b) false

autoencoders belong to supervised neural networks a) true b) false

b) false

Bottleneck Is the most important part of the network a) true b) false

a) true

Convolution Autoencoders Can Do Image Reconstruction a) true b) false

a) true

Deep Autoencoder is composed of two, Symmetrical Deep-Belief networks a) true b) false

a) true

Deep Autoencoders can’t do image search a) true b) false

b) false

Sparse Autoencoders Offer us an alternative method for introducing an information Bottleneck without requiring a reduction in the number of nodes a) true b) false

a) true

Sparse Autoencoders work by penalizing the activation of Neurons in input layer a) true b) false

b) false

Autoencoders can De-Noise images a) true b) false

a) true

Autoencoders can’t be used to reduce dimensionality a) true b) false

b) false

the Encoder module that helps the network "decompress” the knowledge representations and reconstructs the data back from its encoded form a) true b) false

b) false

BERT (bidirectional encoder representation from transformers) is a recent paper published by researchers at Amazon AI Language? a) true b) false

b) false

BERT doesn’t read the text input sequentially? a) true b) false

b) false

BERT Allows Transform Learning on the existing pretrained models and hence can be custom trained for the specific subject. a) true b) false

a) true

In BERT, The relationship between all words in a sentence is Modeled Irrespective of their position. a) true b) false

a) true

BERT uses unidirectional language model for producing word embedding. a) true b) false

b) false

BERT is not an open-source machine learning framework for NLP? a) true b) false

b) false

BERT not understand human language as it is spoken naturally. a) true b) false

b) false

BERT is expected to have large impact on voice search as well as text-based search. a) true b) false

b) false

same word can have multiple word embedding possible with BERT a) True b) False

b) False

BERT is a deep bidirectional, supervised language representation a) true b) false

b) false

pooling is an up-sampling operation that reduces the Dimensionality of the Feature Map. a) true b) false

b) false

The RELU operation is applied to each Pixel and replaces all the negative Pixel values in the Feature Map with Zero a) true b) false

a) true

Pooling Or Spatial Pooling Layers: Also Called Sub-Sampling a) true b) false

a) true

Pooling reduces the Dimensionality of each feature map by retaining the most important information a) true b) false

a) true

the aim of the fully connected layer is to use the low-level feature of the input mage produced by Convolutional and Pooling Layers a) true b) false

b) false

The Hyperparameters for a Pooling Layer are Filter Size, Stride and max or average Pooling a) true b) false

a) true

When we apply a filter of 1×1, then there is no reduction in the size of the image and hence there is no loss of information. a) true b) false

a) true

flattening means that every Neuron in the previous layer is connected to each Neuron in the next layer a) true b) false

b) false

RELU introduces linearity to the network, and the generated output is a Rectified Feature Map a) true b) false

b) false

Convolutional Layer Receives a set of Input Feature Maps (IFM) and generates a set of Output Feature Maps (OFM). a) true b) false

a) true

diffusion models work by destroying training data through the successive addition of Laplacian noise, and then learning to recover the data by reversing this noising process. a) true b) false

b) false

A Discrete Decoder is used to obtain Log likelihoods across Pixel values as the last step in the Reverse Diffusion process. a) true b) false

a) true

diffusion model is a latent variable model which maps to the latent space Sobel using a Fixed chain. a) true b) false

b) false

The goal of training a Diffusion model is to learn the reverse process a) true b) false

a) true

the transition distributions in the Markov chain are Gaussian, which depends only on the forward process. a) true b) false

b) false

Diffusion model is parameterized as a Markov chain, meaning that our latent variables x1, … xt depend only on the previous (or following) timestep. a) true b) false

a) true

for the reverse process in the Diffusion model, we much choose a variance schedule. a) true b) false

b) false

The transition distributions in the Markov chain are Gaussian, where the forward process requires a variance schedule, and the reverse process parameters are learned. a) true b) false

a) true

cascade Diffusion models (like Stable Diffusion) apply the Diffusion process on a smaller latent space for computational efficiency using a Variational Autoencoder for the up and down sampling. a) true b) false

b) false

Diffusion Models can be applied to image De-Noising, Inpainting, Super Resolution, and Image Generation. a) true b) false

a) true

Federated Learning is not used to improve the privacy and security of machine learning models. a) true b) false

b) false

Federated Learning requires the use of a centralized server. a) true b) false

b) false

Federated Learning can’t be used to train models on data that is distributed across multiple devices, such as Smartphones or IoT devices. a) true b) false

b) false

Federated Learning requires the use of a centralized database. a) true b) false

b) false

Federated Learning can’t be used to improve the privacy of machine learning models by keeping sensitive data on individual devices. a) true b) false

b) false

Federated Learning is a Type of machine learning that allows multiple parties to train a model without sharing their data. a) true b) false

a) true

Federated Learning requires participating devices to have high computational power. a) true b) false

b) false

Federated Learning enables Participants to train local models cooperatively on local data without disclosing sensitive data to a central cloud server a) true b) false

a) true

Federated Learning can’t be used to train deep learning models. a) true b) false

b) false

Federated Learning can be used to train models on data that is distributed across multiple devices in real-time. a) true b) false

a) true

In Sequential Data, the points in the dataset are dependent on the other points in the dataset. a) true b) false

a) true

A Timeseries is a common example of Sequential Data, with each point reflecting an observation at a certain point in time. a) true b) false

a) true

the crucial element to remember about sequence models is that the data we’re working with are Independently and Identically Distributed (I.I.D.) samples. a) true b) false

b) false

Sequence models are the machine learning models that input or output sequences of data. a) true b) false

a) true

structured data includes text streams, audio clips, video clips and time-series data. a) true b) false

b) false

the conventional feedforward artificial neural networks can deal with sequential data and can be trained to hold knowledge about the past. a) true b) false

b) false

traditional RNNs are very excellent at capturing Long-Range dependencies. a) true b) false

b) false

LSTMs Are explicitly designed to avoid the Long-Term Dependency problem. a) true b) false

a) true

input gate controls what information should be forgotten. a) true b) false

b) false

input gate helps to Identify important elements that need to be added to the cell state. a) true b) false

a) true

RBMs are a supervised learning technique a) true b) false

b) false

RBM isn’t restricted to have only the connections between the visible and the hidden units a) true b) false

b) false

RBM performs discriminative learning similar to what happens in a classification problem a) true b) false

b) false

If number of visible nodes = nV, number of hidden nodes = nH, then number of connections in RBM = nV* nH a) true b) false

a) true

Boltzmann machines are non-deterministic generative deep learning models with 3 types of nodes: visible, hidden and output nodes a) true b) false

b) false

Boltzmann machines Fall into the class of unsupervised learning. a) true b) false

a) true

sparse Autoencoders introduces information Bottleneck by reducing the number of nodes at hidden layers. a) true b) false

b) false

The idea is to Encourage network to learn an Encoding and Decoding which only relies on activating a small number of neurons. a) true b) false

a) true

To implement Undercomplete Autoencoder, constrain the number of nodes present in hidden layer(s) of the neural network. a) true b) false

a) true

Autoencoders are not capable of learning nonlinear manifolds (a continuous, non-intersecting surface.) a) true b) false

b) false

A Neural Network with multiple hidden layers and Sigmoid nodes can form non-linear decision boundaries. a) true b) false

a) true

Neural Networks compute non-convex functions of their parameters. a) true b) false

b) false

For Logistic Regression, with parameters optimized using a Stochastic Gradient method, setting parameters to 0 is an acceptable initialization. a) true b) false

a) true

For arbitrary Neural Networks, with weights optimized using a Stochastic Gradient method, setting weights to 0 is an acceptable initialization. a) true b) false

b) false

Given a design matrix x ∈ r^(n×d) where d << n, if we project our data onto a k dimensional subspace using PCA where k equals the rank of x, we recreate a perfect representation of our data with no loss. a) true b) false

a) true

hierarchical clustering methods require a predefined number of clusters, much like k-means. a) true b) false

b) false

Given a predefined number of clusters k, globally minimizing the k-means objective function is NP-hard. a) true b) false

a) true

a Random Forest is an ensemble learning method that attempts to lower the bias error of decision trees. a) true b) false

b) false

bagging algorithms attach weights w1...wn to a set of n weak learners. they re-weight the learners and convert them into strong ones. boosting algorithms draw n sample distributions (usually with replacement) from an original data set for learners to train on. a) true b) false

b) false

using cross validation to select Hyperparameters will guarantee that our model does not overfit. a) true b) false

b) false

Bidirectionality is Achieved by a phenomenon called “Masked Language Modeling”. a) true b) false

a) true

BERT Overcomes this shortcoming; in that it considers previous and next tokens to predict the current token. a) true b) false

a) true

XLNet is not the latest and greatest model to emerge from the booming field of natural language processing (NLP). a) true b) false

b) false

XLNet is not “generalized” because it captures Bidirectional context by means of a mechanism called “Permutation Language Modeling”. a) true b) false

b) false

XLNet is not a generalized Autoregressive model where next token is dependent on all previous tokens a) true b) false

b) false

XLNet is the idea of capturing Bidirectional context by training an Autoregressive model on all possible permutation of words in a sentence a) true b) false

b) false

XLNet Integrates the idea of Auto-Regressive models and bi-directional context modeling, yet overcoming the disadvantages of BERT a) true b) false

a) true

Autoregressive (AR) Language Modeling and Autoencoding (AE) have been the two most successful pretraining objectives. a) true b) false

a) true

There are proposed methods used in XLNet like background, objective: permutation language modeling. a) true b) false

a) true

For both BERT and XLNet, partial prediction plays a role of reducing optimization difficulty by only predicting tokens with sufficient context. a) true b) false

a) true

Machine Learning Flashcards

(200 cards)