Final Exam Review Flashcards

1
Q

back propagation -

A

Training Process:
1 .Forward Pass:
The input data is passed through the neural network layer by layer to produce an output. The output is compared to the actual target values, and the error is calculated.

  1. Backward Pass (Backpropagation):
    The algorithm then works backward through the network. It calculates the gradient of the error with respect to the weights of the network. This is done using the chain rule of calculus. The gradients indicate how much the error would increase or decrease if the weights were adjusted.
  2. Weight Update:
    The weights of the network are then updated in the opposite direction of the calculated gradients. This process is repeated iteratively, adjusting the weights to minimize the error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how do we train a neural network to be intelligent to do task like prediction and classification

A

back propagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

back propagation function -

A

it is through back propagation that we update the ways in the matrix so that it will get a good representation of the information - that’s with the neural network approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

with back propagation how are you going to do that weight update?

A

The weights are updated in the opposite direction of the computed gradients. The learning rate determines the step size of this update.
This process is typically repeated for multiple iterations (epochs) until the model converges to a set of weights that minimizes the loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what conceptual things happen during back propagation

A

Training Process:
1 .Forward Pass:

  1. Backward Pass (Backpropagation):
  2. Weight Update:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

the difference between the predicted value and the actual value

A
  • Backpropagation takes the difference between the predicted value and the actual value and uses that error term to adjust each node’s weights.
  • The process works backwards from the final layers to earlier layers, one layer at a time, and computes the contribution that each weight in the given layer had in the loss value.
  • The algorithm that computes the loss value is called a “gradient descent:” this iteratively moves in the direction of greatest improvement in prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

back propagation speed

A

larger steps will allow the learning to happen faster (big vs little) to find the optimal point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

word embedding - -

A

Word embedding is a technique in natural language processing (NLP) and machine learning that represents words as vectors of real numbers. These vectors capture semantic relationships between words, allowing words with similar meanings to have similar vector representations. In other words, word embedding is a way to map words to dense vectors of real numbers, often in a continuous vector space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

epoch

A

one Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

An epoch refers to one complete pass through the entire training dataset during the training of a machine learning model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

batch

A

one batch contains the training examples present in one weight update (recommended: no more than 32)

A batch is a subset of the training dataset that is processed together in one iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Iteration

A

number of iterations/batches = total training data/batch size

An iteration, in the context of training, refers to one update of the model’s weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

batch size calculation based on code

A

Number of Iterations=
Total Dataset Size / Batch Size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

name 3 word embedding techniques.

A

One-hot Vector
TF-IDF
Word2Vec
GloVe
fastText
ELMo
Attention Mechanism –BERT
XLNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why do we need to do word embedding in a neural network approach?

A

word embedding is crucial because it represents words as vectors in a continuous vector space. This helps capture semantic relationships between words, enabling the network to understand context, similarities and differences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

GloVe (Global Vectors for Word Representation):

A

GloVe is an unsupervised learning algorithm that learns word representations by examining global word co-occurrence statistics. It creates embeddings by factorizing the logarithm of the word co-occurrence matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Word2Vec (Word to Vector):

A

Developed by Google, Word2Vec represents words as dense vectors in a continuous vector space. It uses neural networks to learn word embeddings based on the context in which words appear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

FastText:

A

Developed by Facebook, FastText extends Word2Vec by representing each word as a bag of character n-grams. This allows it to generate embeddings for out-of-vocabulary words and capture morphological information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

TF-IDF (Term Frequency-Inverse Document Frequency):

A

While not a neural embedding technique, TF-IDF is a traditional method for representing words based on their importance in a document or a corpus. It is commonly used in information retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ELMo (Embeddings from Language Models):

A

ELMo generates word embeddings by considering the context in which words appear in a sentence. It uses a deep, context-dependent bidirectional LSTM (Long Short-Term Memory) model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

BERT (Bidirectional Encoder Representations from Transformers):

A

BERT is a transformer-based model that considers bidirectional context information for word embeddings. It has been highly successful in various NLP tasks and captures complex linguistic patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

XLNet:

A

An extension of BERT, XLNet uses a permutation language modeling objective to capture bidirectional context information. It overcomes some limitations of BERT, particularly in handling dependencies between words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

CNN - -

A

CNN stands for Convolutional Neural Network. It is a type of artificial neural network designed for processing structured grid data, such as images. CNNs are particularly effective in computer vision tasks, including image recognition, object detection, and image classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the mechanism of CNN

A

Convolutional Layers:
Convolutional layers apply convolution operations to the input data using learnable filters or kernels.
The filters slide across the input, and the convolution operation computes dot products between the filter weights and the local input patches.
The result is a feature map that highlights patterns and features present in the input.

Activation Function:
After the convolutional operation, an activation function (commonly ReLU - Rectified Linear Unit) is applied element-wise to introduce non-linearity.
ReLU is popular because it efficiently handles the vanishing gradient problem and speeds up convergence during training.

Pooling Layers:
Pooling layers perform down-sampling by selecting the maximum or average value from a local region.
Max pooling is often used to retain the most important features from each region.

Fully Connected Layers:
Fully connected layers connect every neuron in one layer to every neuron in the next layer.
They perform classification based on the high-level features learned by the convolutional layers.

Softmax Activation:
For classification tasks, the final layer usually has a softmax activation function to convert the network’s output into probability scores for each class.

Backpropagation:
The network is trained using backpropagation and gradient descent.
The loss between predicted and actual values is calculated, and the gradients are propagated backward through the network.
Optimizers adjust the weights and biases to minimize the loss, iteratively improving the model.

Weight Sharing:
Weight sharing is a key concept in CNNs, where the same set of weights (filters) is used across different spatial locations.
This reduces the number of parameters and helps the network generalize better.

24
Q

filters that slide through

A

In the context of Convolutional Neural Networks (CNNs), filters that “slide through” refer to the convolutional filters or kernels moving across the input data during the convolution operation. Let’s break down the key concepts:

25
Q

CNN calculating the number of parameters

A

input layer = 0
embedding layer = (vocab size x word embedding dimension)

convolutional layer = (kernel x word embedding dimension x filters + filter size)

dropout layer = 0
pooling layer = 0

parameters for each layer = (filter x output + output)

number of parameters is the sum of all parameters

26
Q

RNN calculating the number of parameters

A

NumberofParameters=Units×(InputDimension+Units+1)

27
Q

What is a potential issue with CNN?

A

CNN doesnt remember what has happened

28
Q

RNN - -

A

Recurrent Neural Network.

RNN retains the information

29
Q

What is a potential problem with RNN

A

Vanishing Gradient Problem:

In RNNs, during backpropagation through time (BPTT), the gradient is multiplied across many time steps. If the recurrent weights are small, the gradient may become exponentially small as it is propagated backward through the network.
This results in the network being unable to effectively learn long-range dependencies, making it challenging for the model to capture information from earlier time steps.

30
Q

LSTM - -

A

LSTM stands for Long Short-Term Memory, and it is a type of recurrent neural network (RNN) architecture designed to overcome the limitations of traditional RNNs in capturing and learning long-range dependencies in sequential data.

The LSTM architecture allows the network to selectively remember or forget information based on the context of the input sequence. This makes LSTMs powerful for modeling and learning patterns in sequences with complex dependencies, enabling them to capture information over extended time intervals.

31
Q

GRU

A

GRU stands for “Gated Recurrent Unit.” It is a type of recurrent neural network (RNN) architecture designed to capture and model dependencies in sequential data. GRU is an improvement over traditional RNNs in terms of mitigating the vanishing gradient problem and allowing for better learning of long-term dependencies.

32
Q

may show 2 figures, one is LSTM one is GRU - need to know the difference

A

one has 4 inteligent, one has 3

LSTM: LSTM has a more complex architecture compared to GRU. It includes three gates - input gate, forget gate, and output gate - and a memory cell to control the flow of information.

GRU: GRU has a simpler architecture with only two gates - update gate and reset gate. It doesn’t have a separate memory cell.

33
Q

which refers to the long term memory

A

the “cell state” or “memory cell” is what refers to the long-term memory.

34
Q

which refers to the short term memory

A

the “hidden state” is often considered to represent the short-term memory

35
Q

LSTM 3 gates

A

there are three main gates that control the flow of information:
the input gate
the forget gate
the output gate

These gates are responsible for regulating the information flow into, out of, and within the memory cell.

36
Q

4 four matrix that we need to update.

A

Input Gate Weight Matrix (Wi):
Controls the input information that should be stored in the cell state.

Forget Gate Weight Matrix (Wf):
Controls the information from the previous cell state that should be discarded or forgotten.

Cell State Weight Matrix (Wc):
Controls the values of the cell state, determining how much of the new input and the retained information from the previous state should contribute to the updated cell state.

Output Gate Weight Matrix (Wo):
Controls the information from the updated cell state that should be output as the hidden state.

37
Q

bi-LSTM (bidirectional) what does it mean

A

(one is L to R, other is R to L) or (X0 to Xt or Xt to X0)

38
Q

GRU - -
equation

A

it combines 2 gates into 1
- more simplistic but can perform as good as LSTM

39
Q

Encoder:

A

a LSTM that encodes the input sequence to a fixed-length internal representation W.

The encoder processes the input sequence and transforms it into a fixed-dimensional context vector or a series of hidden states.
Each element of the input sequence is typically embedded into a high-dimensional space, and recurrent or convolutional layers capture the sequential or spatial dependencies.
The final hidden state or context vector summarizes the input sequence.

40
Q

Decoder:

A

another LSTM that takes the internal representation W to extract the output sequence from that vector

The decoder takes the context vector (or hidden states) from the encoder and generates the output sequence.
Similar to the encoder, it usually consists of recurrent layers or other architectures capable of handling sequential data.
During training, the decoder is fed the correct output sequence, one element at a time, and learns to generate the sequence step by step.

41
Q

how do you do the weights of data
how do you train the encoder

A

The training process for the encoder in a neural network, especially in the context of an encoder-decoder architecture, involves “adjusting the weights of the encoder’s parameters to minimize the difference between the predicted outputs and the actual outputs”

42
Q

attention - at a particular decoder time, which one you pay more attention to and which one you pay less (which one and how many)

A

attention is a mechanism that enables neural networks to focus on specific parts of the input sequence when making predictions or generating output. The attention mechanism has proven to be particularly useful in tasks involving sequential data, such as machine translation, text summarization, and question-answering.

43
Q

attention mechanism - calculate the weights based on the current stage in the decoder parts (h1 to hm) - may have a question on this

A

the weights for different parts of the input sequence (encoder hidden states) are calculated based on their relevance to the current step in the decoder. The calculation typically involves a scoring mechanism, often computed using a neural network

44
Q

multihead attention layer

A

Multi-Head Attention layer is a component used in models like the Transformer architecture. The idea behind multi-head attention is to allow the model to jointly attend to different parts of the input sequence in parallel, capturing diverse aspects of the relationships within the data

45
Q

transformer based approach

A

machine learning or deep learning model architecture that utilizes the Transformer architecture. The Transformer is a type of neural network architecture introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. The Transformer architecture has since become a fundamental building block for various natural language processing (NLP) and machine learning tasks.

46
Q

sequential approach

A

The “sequential approach” in the context of neural networks and machine learning refers to the traditional way of processing sequential data, where information is passed through the network one step at a time, typically in a fixed order. Recurrent Neural Networks (RNNs) are a common example of a model architecture that follows a sequential approach.

47
Q

why do we use tanh slide 8

A

hyper tangent function
ReLU function
sigmoid function
identity function

48
Q

what is the neural network approach?

A

neural networks as a computational model to perform tasks such as pattern recognition, classification, regression, and more.

49
Q

how many hidden layers can you have?

A

depends (1 or more than 1)
there is no one-size-fits-all answer

50
Q

what is the purpose of back propagation “networked” approach?

A

to improve the performance of the model

51
Q

what is a batch

A

a batch is a set of data samples used in one iteration of the training process

52
Q

what is learning rate

A
53
Q

why is bert better than one hot vector -

A

bert has context (research more)

One-hot vectors represent words in a fixed and context-independent manner. Each word is encoded as a vector with all zeros except for one element representing the position of that word in the vocabulary.

BERT, on the other hand, provides contextualized word representations. It considers the entire context of a word within a sentence, capturing dependencies and relationships between words.

54
Q

why do we need to do word embedding in neural network approach?

A

Word embedding allows words to be represented as vectors in a continuous vector space. This representation captures semantic relationships between words. Words with similar meanings are closer in the vector space, enabling the model to understand and leverage semantic similarities.

55
Q

can have more than 1 hidden layer

A

yes