Lecture 7 and 8 Flashcards

Neural Networks, Word Vectors

1
Q

Introduction to Neural Nets

A

In 2018, Google introduced new text processing techniques heavily dependent on the use of deep learning neural networks. To understand deep learning, it is valuable to first understand how “regular” artificial neural networks (ANNs) work in a simpler form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Deep Learning

A

Representation learning for automatically learning good features or representations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Representational learning:

A

learning representations of the data that make it easier to extract useful information when building classifiers or other predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Overview of Neural Networks

A
  • Weights: These are calculated during the training process
  • Bias: Like an intercept value in a regression
  • Inputs: Observed Variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Overview of Neural Networks

A

The activation function can be virtually any formula that will produce an output from the summated input, but for learning to work properly, the function must generally be differentiable. Here’s the original perceptron activation function (not differentiable):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overview of Neural Networks

A

Activation Function - f(x)
Output - y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common Activation Functions

A

The activation function of a node defines the output of that node given an input or set of inputs. Programmers choose different activation functions based on the system performance exhibited for various applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Common Activation Functions

A
  • Hyper Tangent Function
  • ReLU Function
  • Sigmoid Function
  • Identity Function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ReLU Function

A

Perceptrons neuron model (left) and activation function (right).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Neural Network Models with Hidden Layers

A

A typical neural network consists of a few layers; an input layer, an optional hidden layer and an output layer. Using an identity activation function and no hidden layers, the analysis is equivalent to OLS regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Deep Learning:

A

Deep learning is simply a more complex neural network. There are often many hidden layers – sometimes dozens – and multiple output nodes to estimate multidimensional output(s). It is also possible to use different activation functions on different nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Training –Forward pass

A

The forward pass
Initially, filter value is randomly assigned -> performance is expected to be (very) bad

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The loss function

A

E(total) = Σ½(target - output)²

Cost/Error function
(mean squared error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The backward pass

A

One way of visualizing this idea of minimizing the loss is to consider a 3-D
graph where the weights of the neural net (there are obviously more than
2 weights, but let’s go for simplicity) are the independent variables and
the dependent variable is the loss. The task of minimizing the loss
involves trying to adjust the weights so that the loss decreases. In visual
terms, we want to get to the lowest point in our bowl shaped object. To
do this, we have to take a derivative of the loss (visual terms: calculate
the slope in every direction) with respect to the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Learning Through Backpropagation

A

Backpropagation takes the difference between the predicted value and the actual value and uses that error term to adjust each node’s weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Learning Through Backpropagation

A

The process works backwards from the final layers to earlier layers, one layer at a time, and computes the contribution that each weight in the given layer had in the loss value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Learning Through Backpropagation

A

The algorithm that computes the loss value is called a “gradient descent:” this iteratively moves in the direction of greatest improvement in prediction

18
Q

Backpropagation

A

includes: the forward pass, loss function, backward pass, and parameter update

19
Q

Epoch:

A

0ne Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

20
Q

Batch:

A

one batch contains the training examples present in one weight update (recommended: no more than 32)

21
Q

Iteration:

A

number of iterations/batches = total training data/batch size

22
Q

Language analysis

A
  • Speech
  • Morphology
  • Syntax
  • Semantics
23
Q

Word Embedding

A

One-hot Vector
TF-IDF
Word2Vec
GloVe
fastText
ELMo
Attention Mechanism –BERT
XLNet

24
Q

One-hot Vector

A

In a vocabulary set, each word is represented as a vector. For example, if word chair is the 5391th word in that vocabulary, we can represent it as O(5391)

25
Q

TF-IDF

A

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally to the number of times a word appears in a document but is offset by the frequency of the word in the corpus. This helps to highlight words that are more specific to a particular document and are less common in the entire corpus.

26
Q

Term Frequency (TF):

A

This measures how often a term (word) appears in a document. It is calculated as the ratio of the number of times the term appears in the document to the total number of terms in the document. TF is usually normalized to prevent it from biasing towards longer documents.

27
Q

Inverse Document Frequency (IDF):

A

This measures how important a term is within the entire corpus. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. Terms that occur in many documents have a lower IDF, and vice versa.

28
Q

Word2Vec

A

popular technique in natural language processing (NLP) and machine learning that is used to represent words as vectors in a continuous vector space. Developed by a team at Google led by Tomas Mikolov, Word2Vec is designed to capture semantic relationships between words based on their context in a given corpus.

29
Q

Continuous Bag of Words (CBOW):

A

This model predicts a target word based on its context. It takes a context of surrounding words (the “bag of words”) as input and tries to predict the target word.

30
Q

Skip-Gram:

A

In contrast to CBOW, the Skip-Gram model predicts the context words (surrounding words) given a target word. It takes a target word as input and tries to predict the words that are likely to appear in its context.

31
Q

GloVe

A

“Global Vectors for Word Representation,” is another popular word embedding technique in natural language processing (NLP). Developed by researchers at Stanford University, GloVe is designed to capture the global context of words in a corpus and create vector representations that encode semantic relationships between words

32
Q

fastText

A

open-source, free, lightweight library developed by Facebook’s AI Research (FAIR) lab for efficient learning of word representations and text classification. It is an extension of the Word2Vec model, developed by the same team. What sets fastText apart is its ability to represent each word as a bag of character n-grams, enabling it to capture morphological information and handle out-of-vocabulary words more effectively

33
Q

ELMo

A

“Embeddings from Language Models,” is a deep contextualized word representation model developed by researchers at the Allen Institute for Artificial Intelligence (AI2). Unlike traditional word embeddings that assign a fixed vector to each word regardless of its context, ELMo produces word representations that are sensitive to the surrounding words in a given sentence. ELMo captures the context-dependent meaning of words by considering their usage in different contexts

34
Q

Attention Mechanism –BERT

A

Bidirectional Encoder Representations from Transformers, is a natural language processing (NLP) model that utilizes an attention mechanism. The attention mechanism is a key component in BERT and many other transformer-based models. Here’s an explanation of the attention mechanism and its role in BERT

35
Q

Attention Mechanism –BERT

A

Attention Mechanism:
The attention mechanism is a mechanism that allows a model to focus on different parts of the input sequence when making predictions. In the context of NLP, this input sequence is often a sequence of words in a sentence. Traditional sequence-to-sequence models or recurrent neural networks (RNNs) process input sequences sequentially, but attention mechanisms enable models to consider all words in the sequence simultaneously

36
Q

XLNet

A

XLNet is a generalized autoregressive pretraining method for language understanding. It is a language model that learns unsupervised representations of text sequences. XLNet is an extension of Transformer-XL and uses an autoregressive method to denoise the input and achieve better performance on various tasks. It is capable of modeling bidirectional contexts, which is why it outperforms pretraining approaches based on autoregressive language modeling like BERT¹. XLNet has been shown to outperform BERT on 20 tasks, including question answering, natural language inference, sentiment analysis, and document ranking¹.

37
Q

Cosine Similarity

A

Two vectors pointing in the same
direction have a cosine similarity of
1; two orthogonal vectors (90
degree angle) have a cosine
similarity of 0

38
Q

Cosine Similarity

A

Cosine distance can be expressed
in difference ways, e.g., 1 – sim

39
Q

Cosine Similarity

A

Cosine similarity can be computed
as the normalized dot product of
the two vectors – very efficient in
Python

40
Q
A